Open ldorigo opened 5 months ago
Actually, the filtering is more broken than I thought: it looks like modifications I make to the input/output dict are actually reflected in the actual objects I use in my code ? so if I delete a key in the langsmith inputs dict, it's no longer present when I try to access it in my code?
You're entirely in control of not mutating your values.
The output is supposed to be whatever you want sent over, so you'd do:
def my_filter(inputs: dict):
my_input_object = inputs['input']
as_dict = my_input_object.dict(exclude={"a_very_secret_number"}))
return {"input": as_dict}
It's meant to be treated functionally. Mutating stuff that's passed by reference is often kinda wonky in general
If we did it after serialization, it would be a lot harder to reason about because you don't get to use your own types
So there's two issues here (which I jumbled together, my bad).
I still think it would be much easier to implement filtering on the final "serialized" dicts representing the objects being sent to langsmith - makes it a lot easier to just traverse the dicts to look for keys we want to remove. The example you showed above feels a bit hacky - then I might have some objects serialized one way (e.g. pydantic_object.dict(exclude=...)
like you showed), and other serialized by whichever way langsmith ends up using? But ok, this is more of an opinion than a bug - I can work around it.
A few comments:
Hmm - while i see there's friction, I still don't agree with the suggested improvements unless you have a good example. Maybe a good middle ground would be if we expose our serializer? Then you could always do
def serialize_inputs(inputs: dict | None):
langsmith_default = default_serialize(inputs)
... add my own logic:
For (1), processing it after we serialize it feels like an even worse UX.
Take a standard python object, for example. If your function uses an object, you could define a serializer on the object:
class Foo:
__slots__ = ['bar', 'baz']
def __init__(self):
self.bar = 5
self.baz = 6
def serialize(self):
# hide baz
return {"bar": self.bar}
So then you could do this:
# Input: {"input": Foo()}
def serialize_inputs(inputs: dict):
foo = inputs['input']
return {
"foo": foo.serialize()
}
Otherwise, you'd get our best effort. In this case it looks OK:
# Input: b'{"bar":5,"baz":6}'
def serialize_inputs(inputs: bytes):
val = orjson.loads(inputs)
return {"foo": val
But in other cases it looks weird, like our default serialization for the literal int
type is super verbose:
b'{"__new__":"<built-in method __new__ of type object at 0x10131d040>","__repr__":"<slot wrapper \'__repr__\' of \'int\' objects>","__hash__":"<slot wrapper \'__hash__\' of \'int\' objects>","__getattribute__":"<slot wrapper \'__getattribute__\' of \'int\' objects>","__lt__":"<slot wrapper \'__lt__\' of \'int\' objects>","__le__":"<slot wrapper \'__le__\' of \'int\' objects>","__eq__":"<slot wrapper \'__eq__\' of \'int\' objects>","__ne__":"<slot wrapper \'__ne__\' of \'int\' objects>","__gt__":"<slot wrapper \'__gt__\' of \'int\' objects>","__ge__":"<slot wrapper \'__ge__\' of \'int\' objects>","__add__":"<slo...
For (2), we ourselves have to copy your object to serialize it
Sorry to let this hang, had to put out a few other fires.
(1) Indeed, I had resorted to just hijacking your private serializing logic. Maybe supporting that officially would be good. And ideally, you would allow passing custom serializers for specific classes - (yet) another issue I'm facing with this is that some classes are not serializable, and there's no easy way to tell langsmith to serialize them a specific way. For instance, I get tons of
[2024-05-22 14:51:48,260: DEBUG] langsmith.client: Failed to serialize <class 'langchain_core.documents.base.Document'> to JSON: Object of type 'DBRef' is not JSON serializable
In this case I think they just get serialized using repr
? but would still be nice to have control over it.
I'm a bit confused by your int
example; why not use normal json.dumps for built-in types? json.dumps(1)
simply returns 1
? Maybe I misunderstood what you're trying to say.
For (2): I think if you take care of copying the object, you could do it in the right place with a lock that prevents other threads from accessing the object while it's being copied? Not really sure of the best way here, it might have some performance impacts.
Hi @ldorigo thanks for pushing on this! In the most recent version I added additional copying of inputs/outputs before the hide_* methods are called - the values would look like the pre-serialized dictionaries without any of the langsmith serialization logic applied. If you run into atomicity issues then I can roll those changes back
We attempt a deep copy in most cases. However there are some situations/payloads (locks, generators, playwright browser handles, etc.) that can't be deepcopied, so we resort to a depth-restricted recursive copy.
Will keep this issue open until we confirm the things are working as intended.
Ah fun - the DBRef objects are coming from a MongoDB integration I take it?
One other thing we could look into doing is some API for letting you annotate or process how to serialize things on a function/runnable object level.
For the SDK i was thinking something like this: https://github.com/langchain-ai/langsmith-sdk/discussions/739
But for langchain core it would probably have to look a bit different
Issue you'd like to raise.
Currently when running the client as
LangSmithClient(hide_inputs=myfilter, hide_outputs=myfilter)
, themyfilter
function is called before the objects are serialized for sending to langsmith.Suppose we have e.g. an object like:
I would like to remove
a_very_secret_number
, but I can't do it while the object is not serialized - Pydantic would throw an error.Suggestion:
No response