serde-rs / serde

Serialization framework for Rust
https://serde.rs/
Apache License 2.0
9.16k stars 774 forks source link

Avoid lossy buffering in #[serde(flatten)] #2186

Open sfackler opened 2 years ago

sfackler commented 2 years ago

The Deserialize implementation of a type with a field annotated #[serde(flatten)] currently buffers up all of the unknown map entries in memory, and then deserializes the flattened value from them at the end. This approach has a couple of downsides:

  1. It causes the deserializer to allocate when that would normally not be the case.
  2. It confuses "smart" deserializers - things like serde_ignored_value or serde_json's error position tracking don't work properly.
  3. It can lose out on special casing in deserializers based on hinting.

Fortunately, there's alternate approach that avoids all of these issues! Instead of buffering, we can create a custom MapAccess which intercepts fields that we want to handle in the top-level type, and then returning the remainder to the nested deserialize implementation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=96546d7ba882b70042268a0b48de6286

jonasbb commented 2 years ago

This is a very interesting idea. I have a couple of remarks / incompatibility observations. Maybe some of them can be resolved.

Currently, the Foo type ultimately decides if a field name is known or unknown. This leads to an incompatibility currently, if deny_unknown_fields is added to Foo. With the normal flatten extra fields will be ignored, since ExtendedFoo allows them. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4e249bb1961ff44945606ed6bdcdab7d

I think that can be fixed by adding an Ignore variant to ExtendedFooFields, similar to the current derive. An unknown field error from the seed of ExtendedFooFieldsVisitor must be converted into the Ignore variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.

It is unclear to me how this could be extended to more than one flattened field. https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7451b4980894d2ddda33b7ce3fc0aa9c An additional complication here is the fact, that the same value can be deserialized multiple times. A map visitor (i.e., Value) will not consume the fields from the internal buffer, but a struct visitor (i.e., Sub) will. This leads the second Value to share all fields with the first except those consumed by the Sub visitor.

Output ```text [src/main.rs:27] d = Data { f0: Object({ "x": String( "foo", ), "y": String( "bar", ), "z": String( "baz", ), }), a: 123, f1: Sub { y: "bar", }, b: 456, f2: Object({ "x": String( "foo", ), "z": String( "baz", ), }), } ```
sfackler commented 2 years ago

An unknown field error from the seed of ExtendedFooFieldsVisitor must be converted into the Ignore variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.

We should be able to handle this by making a custom error type used by the MapAccess implementation that intercepts unknown field errors and delegates the rest to the inner MapAccess's error type.

It is unclear to me how this could be extended to more than one flattened field.

Wow, I was not aware that was a thing you could do! I think you are stuck buffering there to allow the key value to be deserialized multiple times unfortunately. We could continue using the old logic in that case.

sfackler commented 2 years ago

Hmm - we can't actually filter the errors since the error will terminate the deserialization of the inner value entirely...

EDIT: Ah - we may be able to take an approach something like what FlatStructAccess does of remembering the delegate's field list and checking that in our ExtendedFooFieldsSeed.

sfackler commented 2 years ago

On the other hand, the docs on serde.rs do explicitly state that flatten isn't compatible with deny_unknown_fields: https://serde.rs/field-attrs.html#flatten.

sfackler commented 2 years ago

Here's a working implementation of that approach: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b9b5b874c6f4c93d6d82c1bed4e6451e. If you did want to allow unknown fields in ExtendedFoo, we'd codegen the extra Ignored variant for ExtendedFooFields and return that instead of an error in ExtendedFooFieldsSeed.

Mingun commented 2 years ago

I think, that using a json is a bad example to demonstrate absence of "lossy deserialization", because JSON contains some information about value type. Try some untyped format, like XML (but there are no good XML serde adapter, so using some toy format would be better)

Lucretiel commented 2 years ago

I coincidentally made a fully working implementation of this this evening before even seeing this thread! You don't need to depend at all on looking for particular deserialize errors / missing_field notifications from the flattened inner type, because you already know ahead of time what the names of the fields are that you're interested in. You can minimize codegen by creating a new, private trait (I called mine KeyCapture):

pub enum KeyOutcome<T> {
    Accepted(T),
    Rejected,
}

pub trait KeyCapture<'de> {
    // Returned from `try_send_key` to communicate state to `send_value`.
    type Token;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result;

    fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Self::Token>, E>
    where
        E: de::Error;

    fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
    where
        D: de::Deserializer<'de>;
}

In the struct macro codegen, you implement this trait such that it "rejects" keys that are not part of the struct; these keys can then be transparently forwarded to the type being flattened. Once you have this trait, you can use a series of private, reusable adapter types to deserialize the inner flattened field while simultaneously sending data into the KeyCapture. Some snippets:

// From MapAccess:
    fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: de::DeserializeSeed<'de>,
    {
        // We're extracting a key for the type. Try to send keys to the `KeyCapture`,
        // and the first key that is rejected by the `KeyCapture` are is instead to the
        // `seed`
        loop {
            seed = match self.map.next_key_seed(ExtractKeySeed {
                seed,
                capture: &mut self.capture,
            })? {
                None => {
                    self.drained = true;
                    return Ok(None);
                }
                Some(ExtractKeySeedResult::Deserialized(value)) => return Ok(Some(value)),
                Some(ExtractKeySeedResult::Captured { seed, token }) => {
                    self.map.next_value_seed(ExtractValueSeed {
                        token,
                        capture: &mut self.capture,
                    })?;

                    seed
                }
            };
        }
    }

// From Visitor:
    fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
    where
        E: de::Error,
    {
        match self.capture.try_send_key(v.as_bytes())? {
            KeyOutcome::Accepted(token) => Ok(ExtractKeySeedResult::Captured {
                seed: self.seed,
                token,
            }),
            KeyOutcome::Rejected => self
                .seed
                .deserialize(v.into_deserializer())
                .map(ExtractKeySeedResult::Deserialized),
        }
    }

Here's what an example codegen might look like for a type called Data2:

struct Data2 {
    key3: String,

    // #[serde(flatten)]
    inner: Data1,
}

impl<'de> de::Deserialize<'de> for Data2 {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        enum Field {
            key3
        }

        #[derive(Default)]
        struct Data2Capture {
            key3: Option<String>,
        }

        impl<'de, 'a> KeyCapture<'de> for &'a mut Data2Capture {
            type Token = Field;

            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
                write!(formatter, "a Data2 struct")
            }

            #[inline]
            fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Field>, E>
            where
                E: de::Error,
            {
                Ok(match key {
                    b"key3" => KeyOutcome::Accepted(Field::key3),
                    _ => KeyOutcome::Rejected,
                })
            }

            #[inline]
            fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
            where
                D: de::Deserializer<'de>,
            {
                match token {
                    Field::key3 => de::Deserialize::deserialize(value).map(|value| {
                        self.key3 = Some(value);
                    }),
                }
            }
        }

        let mut capture = Data2Capture::default();

        let inner = Data1::deserialize(ExtractDeserializer {
            deserializer,
            capture: &mut capture,
        })?;

        let key3 = match capture.key3 {
            Some(key3) => key3,
            None => return Err(de::Error::missing_field("key3")),
        };

        Ok(Self { inner, key3 })
    }

I was additionally able to determine that it is outright impossible (without buffering) to have more than one flattened field. In order to pull that off, you'd need to implement Deserialize for this primitive in a way that spreads the keys out to both fields:

struct MapPair<T1, T2> { pub map1: T1, pub map2: T2 }

You can put together a series of wrapper types and get pretty far, but eventually you end up at this point:

    // We have a pair of visitors, visitor1 and visitor2. We build our own sequence type and 
    // visit it to sequence 1:
    fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: de::DeserializeSeed<'de>
    {
        // `seed` is the receiver for `map1`'s value, and we have a `visitor2` lying around.
        // At this point, we need to loop over the internal sequence type to try to send keys
        // to `visitor2`, and detect an `unrecognized_field` error to feed to the seed.
        // Unfortunately the only interface we have to `visitor2` is `visit_seq`, which must
        // build a completed `map2`. We're at an impasse.
    }

It's basically the same reason that you can't unpack an Iterator of (T1, T2) into a pair of types that are FromIterator<T1>, FromIterator<T2>.

Lucretiel commented 2 years ago

Completed implementation for consideration: https://github.com/Lucretiel/serde-bufferless

Mingun commented 2 years ago

I think it is better to fix this generically, not just for the case with one flattened field, because in that case it creates a friction when users realising that small change in definitions of their types will broke deserialization.

To achieve that, changes only on deserializer side is not enough. I work on this problem, and you can see my unfinished work at https://github.com/serde-rs/serde/compare/master...Mingun:flatten. The actual changes begins from 1071b39e63077789beb9e55af7384462db72a3bd, the other is a prepare work.

The key idea in that #[derive(Deserialize)] implements additional trait DeserializeEmbed on types that could be flattened. This approach solves many problems:

I do not remember the exact status of readiness of this, but I remember that only one of the enum representation options remained to be implemented (seems untagged representation).

Chosen approach has downsides, but their are not so big:

The two scenarios is following:

  1. User implement Deserialize manually and that type is flattened into another type. I think this is very rare case
  2. User try to flatten field of type that cannot be flattened, such of String example above. This is in any case an error in the user code that anyway should be fixed, because currently such types will always deserialized with an error which, obviously, not what the user want. Unfortunately, because DeserializeEmbed implemented only in #[derive(Deserialize)], having only #[derive(Serialize)] on the type still allow you to create an erroneous type. I don't know, is it possible to automatically implement some trait twice from different derives without ambiguity, which is by good required here
RReverser commented 1 year ago

Ha, reinvented the same wheel (deserializing with just one flat field without buffers) recently and only now discovered this thread. I suppose I should try to switch to serde-bufferless linked above instead.

RReverser commented 1 year ago

@Lucretiel Any chance you'd want to add a proc-macro to your repo and publish it to crates.io?

palant commented 5 months ago

I’ve taken a similar approach to @Mingun, this issue cannot really be solved without API changes. The deserializers need to expose what fields they can take, and they need to accept fields individually rather than an entire map. I’ve added a DeserializeMap trait extending Deserialize for map-based types as well as the corresponding MapVisitor trait which actually collects the values:

pub trait DeserializeMap<'de>: Deserialize<'de> {
    type Visitor: MapVisitor<'de, Value = Self>;
    fn visitor() -> Self::Visitor;
}

pub trait MapVisitor<'de> {
    type Value;
    fn accepts_field(field: &str) -> bool;
    fn list_fields(list: &mut Vec<&'static str>);
    fn visit_field<D>(self, field: &str, deserializer: D) -> Result<Self, D::Error>
    where
        Self: Sized,
        D: serde::de::Deserializer<'de>;
    fn finalize<E>(self) -> Result<Self::Value, E>
    where
        E: serde::de::Error;
}

Note: While list_fields method doesn’t follow the zero allocation paradigm, it’s only there for nicer error messages.

The Deserialize trait can be implemented generically (as in: blanket implementation would have been possible if not for foreign types) for any data structure implementing DeserializeMap. And a DeserializeMap implementation can delegate processing of individual keys to other DeserializeMap implementations as needed.

Implementation here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ce2619418043d821e8bc67d1fc8609a4

I’ve got derive macros implemented as well but these are specialized for my use case.