toml-rs / toml

Rust TOML Parser
https://docs.rs/toml
Apache License 2.0
732 stars 108 forks source link

Spanned values don't play well with dotted notation: getting "invalid type: string [..], expected a borrowed string" #798

Open yannham opened 2 months ago

yannham commented 2 months ago

I'm using Spanned to deserialize TOML to Nickel (a configuration language) while preserving spans as much as possible, as Nickel adds validation capabilities and we'd like to link back validation errors to the precise piece of TOML data that failed.

To do so, we define a bespoke datastructure that is more or less like a TOML value but with Spanned appropriately sprinkled, and write a custom deserializer using serde_untagged. You can find the type definition and the deserializer here: https://github.com/tweag/nickel/blob/927ee23993747b7851e51bcfe3eb3e685ba4ebb1/core/src/serialize.rs#L491-L582

However, when deserializing the following file:

[foo.bar]
baz = "qux"

This gives the following surprising error:

error: toml parse error: TOML parse error at line 1, column 1
  |
1 | [foo.bar]
  | ^
invalid type: string "bar", expected a borrowed string

in `foo`

It's surprising because we never try to deserialize a borrowed string: all strings, both as terminal values and keys, are owned in SpannedValue (NickelString is a simple wrapper around String). Also, any TOML file without dotted notation is parsed fine. After some experimentation, it seems that this happens when trying to deserialize the value (and not the key) of the outer map, that is the value associated to foo.

I suspect that there are some shenanigans around getting the location of the nested map {bar = {baz = "qux"}}. It seems that the spanned deserializer of toml-rs tries to deserialize markers as borrowed string (https://github.com/toml-rs/toml/blob/b05e8c489be8ebfc0acacc1ec3556d95cd8d2198/crates/serde_spanned/src/spanned.rs#L161) but it also expects a very precise structure, so I'm not entirely sure what's going on here.

The issue is that I don't see any easy work-around: once we've tried to deserialize the content of a map as spanned (which is entirely legit for files that don't have the dotted notation), there doesn't seem to be anyway to retry the same deserialization at a different type.

yannham commented 2 months ago

(For the record, the explicitly nested version foo = {bar = {baz = "qux"}} parses correctly, so it really seems to be around dotted field syntax)

epage commented 2 months ago

Do you have an isolated reproduction case for this?

yannham commented 2 months ago

I can try to make one, yes.

jneem commented 1 month ago

Here's a smallish reproduction, which crashes with "invalid type: string \"bar\", expected a borrowed string". Removing the Spanned on line 16 makes it succeed.

#!/usr/bin/env -S cargo -Zscript
---cargo
[dependencies]
serde = { version = "1", features = ["derive"] }
serde-untagged = "0.1.6"
toml = "0.8.19"
---

use serde_untagged::UntaggedEnumVisitor;
use serde::de::{Deserializer, MapAccess};
use toml::Spanned;

#[derive(Debug)]
pub enum SpannedValue {
    String(String),
    Map(Vec<(String, Spanned<SpannedValue>)>)
}

impl<'de> serde::Deserialize<'de> for SpannedValue {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: Deserializer<'de>,
    {
        let data = UntaggedEnumVisitor::new()
            .string(|str| Ok(SpannedValue::String(str.into())))
            .map(|mut map| {
                let mut result = Vec::new();

                while let Some((k, v)) = map.next_entry()? {
                    result.push((k, v));
                }

                Ok(SpannedValue::Map(result))
            })
            .deserialize(deserializer)?;

        Ok(data)
    }
}

const INPUT: &str = r#"
[foo.bar]
baz = "qux"
"#;

fn main() {
    let val: SpannedValue = toml::from_str(INPUT).unwrap();
    dbg!(val);
}
epage commented 1 month ago

Thanks for the reproduction case!

We have tests for Spanned being used in arrays, keys, and values, but not in recursive data structures like this. It appears that untagged enums, whether using serde_untagged or using #[serde(untagged)] isn't supported at this time.

serde is a bit of a mess to dig into to support cases like this. I personally will likely not get to this for a bit but would be happy with any help on this.

yannham commented 1 month ago

In the meantime, if anyone has the same issue and is looking for a way out, our current work-around is to use the lower-level toml-edit crate, which works: https://github.com/tweag/nickel/pull/2074.