Open sfackler opened 2 years ago
This is a very interesting idea. I have a couple of remarks / incompatibility observations. Maybe some of them can be resolved.
Currently, the Foo
type ultimately decides if a field name is known or unknown. This leads to an incompatibility currently, if deny_unknown_fields
is added to Foo
. With the normal flatten
extra fields will be ignored, since ExtendedFoo
allows them.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4e249bb1961ff44945606ed6bdcdab7d
I think that can be fixed by adding an Ignore
variant to ExtendedFooFields
, similar to the current derive. An unknown field error from the seed of ExtendedFooFieldsVisitor
must be converted into the Ignore
variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.
It is unclear to me how this could be extended to more than one flattened field.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7451b4980894d2ddda33b7ce3fc0aa9c
An additional complication here is the fact, that the same value can be deserialized multiple times. A map visitor (i.e., Value
) will not consume the fields from the internal buffer, but a struct visitor (i.e., Sub
) will. This leads the second Value
to share all fields with the first except those consumed by the Sub
visitor.
An unknown field error from the seed of
ExtendedFooFieldsVisitor
must be converted into theIgnore
variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.
We should be able to handle this by making a custom error type used by the MapAccess implementation that intercepts unknown field errors and delegates the rest to the inner MapAccess's error type.
It is unclear to me how this could be extended to more than one flattened field.
Wow, I was not aware that was a thing you could do! I think you are stuck buffering there to allow the key value to be deserialized multiple times unfortunately. We could continue using the old logic in that case.
Hmm - we can't actually filter the errors since the error will terminate the deserialization of the inner value entirely...
EDIT: Ah - we may be able to take an approach something like what FlatStructAccess does of remembering the delegate's field list and checking that in our ExtendedFooFieldsSeed.
On the other hand, the docs on serde.rs do explicitly state that flatten
isn't compatible with deny_unknown_fields
: https://serde.rs/field-attrs.html#flatten.
Here's a working implementation of that approach: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b9b5b874c6f4c93d6d82c1bed4e6451e. If you did want to allow unknown fields in ExtendedFoo
, we'd codegen the extra Ignored
variant for ExtendedFooFields
and return that instead of an error in ExtendedFooFieldsSeed
.
I think, that using a json is a bad example to demonstrate absence of "lossy deserialization", because JSON contains some information about value type. Try some untyped format, like XML (but there are no good XML serde adapter, so using some toy format would be better)
I coincidentally made a fully working implementation of this this evening before even seeing this thread! You don't need to depend at all on looking for particular deserialize errors / missing_field notifications from the flattened inner type, because you already know ahead of time what the names of the fields are that you're interested in. You can minimize codegen by creating a new, private trait (I called mine KeyCapture
):
pub enum KeyOutcome<T> {
Accepted(T),
Rejected,
}
pub trait KeyCapture<'de> {
// Returned from `try_send_key` to communicate state to `send_value`.
type Token;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result;
fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Self::Token>, E>
where
E: de::Error;
fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
where
D: de::Deserializer<'de>;
}
In the struct macro codegen, you implement this trait such that it "rejects" keys that are not part of the struct; these keys can then be transparently forwarded to the type being flattened. Once you have this trait, you can use a series of private, reusable adapter types to deserialize the inner flattened field while simultaneously sending data into the KeyCapture
. Some snippets:
// From MapAccess:
fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
where
K: de::DeserializeSeed<'de>,
{
// We're extracting a key for the type. Try to send keys to the `KeyCapture`,
// and the first key that is rejected by the `KeyCapture` are is instead to the
// `seed`
loop {
seed = match self.map.next_key_seed(ExtractKeySeed {
seed,
capture: &mut self.capture,
})? {
None => {
self.drained = true;
return Ok(None);
}
Some(ExtractKeySeedResult::Deserialized(value)) => return Ok(Some(value)),
Some(ExtractKeySeedResult::Captured { seed, token }) => {
self.map.next_value_seed(ExtractValueSeed {
token,
capture: &mut self.capture,
})?;
seed
}
};
}
}
// From Visitor:
fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
where
E: de::Error,
{
match self.capture.try_send_key(v.as_bytes())? {
KeyOutcome::Accepted(token) => Ok(ExtractKeySeedResult::Captured {
seed: self.seed,
token,
}),
KeyOutcome::Rejected => self
.seed
.deserialize(v.into_deserializer())
.map(ExtractKeySeedResult::Deserialized),
}
}
Here's what an example codegen might look like for a type called Data2
:
struct Data2 {
key3: String,
// #[serde(flatten)]
inner: Data1,
}
impl<'de> de::Deserialize<'de> for Data2 {
fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
where
D: serde::Deserializer<'de>,
{
enum Field {
key3
}
#[derive(Default)]
struct Data2Capture {
key3: Option<String>,
}
impl<'de, 'a> KeyCapture<'de> for &'a mut Data2Capture {
type Token = Field;
fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(formatter, "a Data2 struct")
}
#[inline]
fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Field>, E>
where
E: de::Error,
{
Ok(match key {
b"key3" => KeyOutcome::Accepted(Field::key3),
_ => KeyOutcome::Rejected,
})
}
#[inline]
fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
where
D: de::Deserializer<'de>,
{
match token {
Field::key3 => de::Deserialize::deserialize(value).map(|value| {
self.key3 = Some(value);
}),
}
}
}
let mut capture = Data2Capture::default();
let inner = Data1::deserialize(ExtractDeserializer {
deserializer,
capture: &mut capture,
})?;
let key3 = match capture.key3 {
Some(key3) => key3,
None => return Err(de::Error::missing_field("key3")),
};
Ok(Self { inner, key3 })
}
I was additionally able to determine that it is outright impossible (without buffering) to have more than one flattened field. In order to pull that off, you'd need to implement Deserialize
for this primitive in a way that spreads the keys out to both fields:
struct MapPair<T1, T2> { pub map1: T1, pub map2: T2 }
You can put together a series of wrapper types and get pretty far, but eventually you end up at this point:
// We have a pair of visitors, visitor1 and visitor2. We build our own sequence type and
// visit it to sequence 1:
fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
where
K: de::DeserializeSeed<'de>
{
// `seed` is the receiver for `map1`'s value, and we have a `visitor2` lying around.
// At this point, we need to loop over the internal sequence type to try to send keys
// to `visitor2`, and detect an `unrecognized_field` error to feed to the seed.
// Unfortunately the only interface we have to `visitor2` is `visit_seq`, which must
// build a completed `map2`. We're at an impasse.
}
It's basically the same reason that you can't unpack an Iterator
of (T1, T2)
into a pair of types that are FromIterator<T1>
, FromIterator<T2>
.
Completed implementation for consideration: https://github.com/Lucretiel/serde-bufferless
I think it is better to fix this generically, not just for the case with one flattened field, because in that case it creates a friction when users realising that small change in definitions of their types will broke deserialization.
To achieve that, changes only on deserializer side is not enough. I work on this problem, and you can see my unfinished work at https://github.com/serde-rs/serde/compare/master...Mingun:flatten. The actual changes begins from 1071b39e63077789beb9e55af7384462db72a3bd, the other is a prepare work.
The key idea in that #[derive(Deserialize)]
implements additional trait DeserializeEmbed
on types that could be flattened. This approach solves many problems:
#[serde(flatten)]
on fields of non-flattanable types, such as String
, because DeserializeEmbed
is not implemented for strings (see eac15662a5e045918e93e64b5a7215fe12699307)Option
in untagged enums: https://github.com/serde-rs/serde/blob/c3ce2c934a9123b5a309256d65cca6b6d19466ec/serde/src/de/mod.rs#L1672-L1679I do not remember the exact status of readiness of this, but I remember that only one of the enum representation options remained to be implemented (seems untagged representation).
Chosen approach has downsides, but their are not so big:
DeserializeEmbed
derivered on operation basis. This is the fundamental restriction, because I trying to make this invasive as minimal as I possible and I almost approached that: except two scenarios, described below the end-user don't need to change in their code nothingThe two scenarios is following:
Deserialize
manually and that type is flattened into another type. I think this is very rare caseString
example above. This is in any case an error in the user code that anyway should be fixed, because currently such types will always deserialized with an error which, obviously, not what the user want. Unfortunately, because DeserializeEmbed
implemented only in #[derive(Deserialize)]
, having only #[derive(Serialize)]
on the type still allow you to create an erroneous type. I don't know, is it possible to automatically implement some trait twice from different derives without ambiguity, which is by good required hereHa, reinvented the same wheel (deserializing with just one flat field without buffers) recently and only now discovered this thread. I suppose I should try to switch to serde-bufferless linked above instead.
@Lucretiel Any chance you'd want to add a proc-macro to your repo and publish it to crates.io?
I’ve taken a similar approach to @Mingun, this issue cannot really be solved without API changes. The deserializers need to expose what fields they can take, and they need to accept fields individually rather than an entire map. I’ve added a DeserializeMap
trait extending Deserialize
for map-based types as well as the corresponding MapVisitor
trait which actually collects the values:
pub trait DeserializeMap<'de>: Deserialize<'de> {
type Visitor: MapVisitor<'de, Value = Self>;
fn visitor() -> Self::Visitor;
}
pub trait MapVisitor<'de> {
type Value;
fn accepts_field(field: &str) -> bool;
fn list_fields(list: &mut Vec<&'static str>);
fn visit_field<D>(self, field: &str, deserializer: D) -> Result<Self, D::Error>
where
Self: Sized,
D: serde::de::Deserializer<'de>;
fn finalize<E>(self) -> Result<Self::Value, E>
where
E: serde::de::Error;
}
Note: While list_fields
method doesn’t follow the zero allocation paradigm, it’s only there for nicer error messages.
The Deserialize
trait can be implemented generically (as in: blanket implementation would have been possible if not for foreign types) for any data structure implementing DeserializeMap
. And a DeserializeMap
implementation can delegate processing of individual keys to other DeserializeMap
implementations as needed.
Implementation here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ce2619418043d821e8bc67d1fc8609a4
I’ve got derive macros implemented as well but these are specialized for my use case.
The
Deserialize
implementation of a type with a field annotated#[serde(flatten)]
currently buffers up all of the unknown map entries in memory, and then deserializes the flattened value from them at the end. This approach has a couple of downsides:Fortunately, there's alternate approach that avoids all of these issues! Instead of buffering, we can create a custom
MapAccess
which intercepts fields that we want to handle in the top-level type, and then returning the remainder to the nested deserialize implementation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=96546d7ba882b70042268a0b48de6286