langchain-ai / langsmith-sdk

LangSmith Client SDK Implementations
https://docs.smith.langchain.com/
MIT License
421 stars 80 forks source link

python[patch]: evaluators can return primitives #1203

Closed baskaryan closed 1 week ago

baskaryan commented 1 week ago
from langsmith import evaluate

def foo(run, example):
    return 0

def bar(run, example):
    return "long"

# removed, needs to be list of dict
# def baz(run, example):
#     return [0, 0.2, "how are ya"]

def app(inputs):
    return inputs

evaluate(app, data="Sample Dataset 3", evaluators=[foo, bar, baz])
baskaryan commented 1 week ago

how ^ example gets logged

Screenshot 2024-11-11 at 10 38 17 AM

hinthornw commented 1 week ago

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

baskaryan commented 1 week ago

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

are we able to support list values as feedback?

hinthornw commented 1 week ago

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

are we able to support list values as feedback?

Only dict or string it seems.

Maybe doing the _ix suffix is preferable. I'm not sure honestly. Probalby less surprising than just logging duplicates to the same key.

It just messes with the experiment averages (we'd be averaging over index 1 and over index 2 )

baskaryan commented 1 week ago

The list return type seems unexpected to me - I wouldn't expect it to make separate keys for each value

are we able to support list values as feedback?

Only dict or string it seems.

Maybe doing the _ix suffix is preferable. I'm not sure honestly. Probalby less surprising than just logging duplicates to the same key.

It just messes with the experiment averages (we'd be averaging over index 1 and over index 2 )

feel less strongly about list behavior either way, think main use case is supporting int/float/bool/str

hinthornw commented 1 week ago

Maybe let's land with the numeric and string value support but hold off on list behavior?

Maybe add support for list[evaluationresultlike] to make the "results" key not necessary