kinnison / marked-data

A YAML representation library which explicitly retains provenance data
MIT License
13 stars 6 forks source link

Flattened fields get cast into strings #11

Closed CardboardTurkey closed 5 months ago

CardboardTurkey commented 5 months ago

Suppose I have a struct like:

#[derive(Deserialize)]
struct Outer {
    #[serde(flatten)]
    inner: Inner,
}
#[derive(Deserialize)]
struct Inner {
    num: u8,
}

I should be able to deserialize this yaml into Outer:

num: 10

However, if I try to do so with marked_yaml::from_yaml I get the error:

invalid type: string "10", expected u8

It seems that no matter what type I make Inner::num, marked_yaml will always cast the num value to a string before complaining about type mismatch.

This seems to be an issue with the flattening because

#[derive(Deserialize)]
struct Outer {
    num: u8
}

can (of course) deserialize the yaml fine.

kinnison commented 5 months ago

I have written a regression test and can confirm this bug. I shall now investigate.

kinnison commented 5 months ago

Okay, this comes down to the fact that when the derive for the fields on inner are handled, the generated deserializer uses an internal implementation detail of serde called Content which is also used for untagged enums and some other things. This approach guarantees that the deserializer in marked_yaml isn't given the type hints it needs to work out what's going on since the default for marked_yaml is not not ascribe any semantics to the values of scalars (they are always initially strings). This kind of thing also guarantees Spanned<T> won't work inside flattened structs as well by the looks of things.

I'm not sure that there's a way around this; but I shall investigate more.

kinnison commented 5 months ago

This is definitely wound up in the magical internals of serde - sadly it looks like flatten and untagged enums (I previously only knew about the latter) are entirely reliant on the data format being fully self-describing which marked-yaml is, but only in the sense of every value is a string.

I'd need to put some thought into adding opportunistic deserialisation. Looking at how the number types are handled for deserialisation inside serde, I think that if we could have opportunistic deserialisation (ie. when hitting _any for a scalar, try it as a boolean or a number first) then perhaps we could get further; though you'd still not be able to use Spanned<T> inside flattened or untagged things.

Would that be sufficient? If so, I can try and add such a capability.

CardboardTurkey commented 5 months ago

No unfortunately my particular use case requires flattened spanned fields

kinnison commented 5 months ago

No unfortunately my particular use case requires flattened spanned fields

I'll have a think - it might be that your best bet is to integrate the relevant fields into the Outer struct and then write a trait to deal with multiple structs containing the same "flattened" fields

kinnison commented 5 months ago

Right, basically the answer is "nope, not going to happen, sorry" - this is a fundamental disconnect/gap in the serde data model which probably would require a fancier code generation than standard derive proc macros in order to resolve.

I don't see this as possible within the current constraints. As such I reiterate - manually flatten the fields and add a trait if you need it. Sorry.

palant commented 4 months ago

For reference: manual flattening with serde_yaml::Value as intermediate storage instead of serde::private::de::content::Content (since marked_yaml::Node won’t deserialize) causes the same issue. I’ll need to investigate this issue some more to see whether it can be worked around.

kinnison commented 4 months ago

@palant If you can come up with a way to do so then that'd be amazing; unfortunately I can't really find a way to annotate either of the intermediate formats with the right amount of location data to "maybe" be consumed. It'd be easier if a deserializer could specify what intermediate format to use; then we could write support to just use the nodes as the intermediate format.

palant commented 4 months ago

@kinnison I think I see the issue. This library generally treats scalars as strings. There are methods like as_u32 but that’s merely syntactic sugar for “take that string and try to parse it as an integer.”

Serde on the other hand expects data types to be represented differently. So when serde-yaml encounters 1234 then that’s a number, and a number cannot be a string. If you want a string then you have to write "1234".

When Serde stores data in Content it assumes that each value has only one type. This is inherently incompatible with how marked-yaml treats strings, numbers and booleans as interchangeable. Properly supporting the Serde interface is likely impossible without major changes to type handling in this library.

kinnison commented 4 months ago

Yeah, it'd be easy if we could convince serde to treat the "intermediate" format as Node rather than one of the serde data format; but it is what it is. Oh well. Part of the point of marked-yaml is that it does NOT try and guess what type a value should be. Thanks for looking into this.