Issue: filtering inputs/outputs should happen after serialization

ldorigo commented 5 months ago

Issue you'd like to raise.

Currently when running the client as LangSmithClient(hide_inputs=myfilter, hide_outputs=myfilter), the myfilter function is called before the objects are serialized for sending to langsmith.

Suppose we have e.g. an object like:

class MyInputObject(BaseModel):
       foo: str
       a_very_secret_number: int

I would like to remove a_very_secret_number, but I can't do it while the object is not serialized - Pydantic would throw an error.

Suggestion:

No response

ldorigo commented 5 months ago

Actually, the filtering is more broken than I thought: it looks like modifications I make to the input/output dict are actually reflected in the actual objects I use in my code ? so if I delete a key in the langsmith inputs dict, it's no longer present when I try to access it in my code?

hinthornw commented 5 months ago

You're entirely in control of not mutating your values.

The output is supposed to be whatever you want sent over, so you'd do:

def my_filter(inputs: dict):
    my_input_object = inputs['input']
    as_dict = my_input_object.dict(exclude={"a_very_secret_number"}))
    return {"input": as_dict}

It's meant to be treated functionally. Mutating stuff that's passed by reference is often kinda wonky in general

If we did it after serialization, it would be a lot harder to reason about because you don't get to use your own types

ldorigo commented 5 months ago

So there's two issues here (which I jumbled together, my bad).

1. filtering should happen on the serialized objects (my opinion):

I still think it would be much easier to implement filtering on the final "serialized" dicts representing the objects being sent to langsmith - makes it a lot easier to just traverse the dicts to look for keys we want to remove. The example you showed above feels a bit hacky - then I might have some objects serialized one way (e.g. pydantic_object.dict(exclude=...) like you showed), and other serialized by whichever way langsmith ends up using? But ok, this is more of an opinion than a bug - I can work around it.

2. Mutable inputs

A few comments:

"It's meant to be treated functionally." since python isn't a functional language that enforces immutability, you might want to at the very least document it somehow :-) It's very unexpected that mutating an object that is being sent to a tracing server should modify the object in my application.
More importantly: the tracing/filtering is running on a separate thread (asyncio executor), which means that: a. The input object (which is mutable and "passed by reference" as you say) could be modified by application code in another thread meaning the data sent to langsmith is not correct? b. It's not even possible to solve my problem by doing a deepcopy of the object (which would be the easy way of making sure I don't mutate the original object): since deepcopy isn't atomic, I regularly get errors along the lines of "dictionary changed size during iteration"

hinthornw commented 5 months ago

Hmm - while i see there's friction, I still don't agree with the suggested improvements unless you have a good example. Maybe a good middle ground would be if we expose our serializer? Then you could always do

def serialize_inputs(inputs: dict | None):
     langsmith_default = default_serialize(inputs)

    ... add my own logic:

For (1), processing it after we serialize it feels like an even worse UX.

Take a standard python object, for example. If your function uses an object, you could define a serializer on the object:

class Foo:
    __slots__ = ['bar', 'baz']
    def __init__(self):
        self.bar = 5
        self.baz = 6
    def serialize(self):
        # hide baz
        return {"bar": self.bar}

So then you could do this:

# Input: {"input": Foo()}
def serialize_inputs(inputs: dict):
    foo = inputs['input']
    return {
        "foo": foo.serialize()
       }

Otherwise, you'd get our best effort. In this case it looks OK:

# Input: b'{"bar":5,"baz":6}'
def serialize_inputs(inputs: bytes):
      val = orjson.loads(inputs)
      return {"foo": val

But in other cases it looks weird, like our default serialization for the literal int type is super verbose:

b'{"__new__":"<built-in method __new__ of type object at 0x10131d040>","__repr__":"<slot wrapper \'__repr__\' of \'int\' objects>","__hash__":"<slot wrapper \'__hash__\' of \'int\' objects>","__getattribute__":"<slot wrapper \'__getattribute__\' of \'int\' objects>","__lt__":"<slot wrapper \'__lt__\' of \'int\' objects>","__le__":"<slot wrapper \'__le__\' of \'int\' objects>","__eq__":"<slot wrapper \'__eq__\' of \'int\' objects>","__ne__":"<slot wrapper \'__ne__\' of \'int\' objects>","__gt__":"<slot wrapper \'__gt__\' of \'int\' objects>","__ge__":"<slot wrapper \'__ge__\' of \'int\' objects>","__add__":"<slo...

For (2), we ourselves have to copy your object to serialize it

ldorigo commented 5 months ago

Sorry to let this hang, had to put out a few other fires.

(1) Indeed, I had resorted to just hijacking your private serializing logic. Maybe supporting that officially would be good. And ideally, you would allow passing custom serializers for specific classes - (yet) another issue I'm facing with this is that some classes are not serializable, and there's no easy way to tell langsmith to serialize them a specific way. For instance, I get tons of

[2024-05-22 14:51:48,260: DEBUG] langsmith.client: Failed to serialize <class 'langchain_core.documents.base.Document'> to JSON: Object of type 'DBRef' is not JSON serializable

In this case I think they just get serialized using repr? but would still be nice to have control over it.

I'm a bit confused by your int example; why not use normal json.dumps for built-in types? json.dumps(1) simply returns 1? Maybe I misunderstood what you're trying to say.

For (2): I think if you take care of copying the object, you could do it in the right place with a lock that prevents other threads from accessing the object while it's being copied? Not really sure of the best way here, it might have some performance impacts.

hinthornw commented 4 months ago

Hi @ldorigo thanks for pushing on this! In the most recent version I added additional copying of inputs/outputs before the hide_* methods are called - the values would look like the pre-serialized dictionaries without any of the langsmith serialization logic applied. If you run into atomicity issues then I can roll those changes back

We attempt a deep copy in most cases. However there are some situations/payloads (locks, generators, playwright browser handles, etc.) that can't be deepcopied, so we resort to a depth-restricted recursive copy.

Will keep this issue open until we confirm the things are working as intended.

hinthornw commented 4 months ago

Ah fun - the DBRef objects are coming from a MongoDB integration I take it?

hinthornw commented 4 months ago

One other thing we could look into doing is some API for letting you annotate or process how to serialize things on a function/runnable object level.

For the SDK i was thinking something like this: https://github.com/langchain-ai/langsmith-sdk/discussions/739

But for langchain core it would probably have to look a bit different

langchain-ai / langsmith-sdk