serde-rs / serde

Serialization framework for Rust
https://serde.rs/
Apache License 2.0
9.06k stars 767 forks source link

Allow backtracking in `deserialize_option` method #2712

Open kurtbuilds opened 6 months ago

kurtbuilds commented 6 months ago

I wrote a serde implementation of the X12 file format, which is a enterprise-y text file format somewhat similar to CSVs. A document has a header, footer, and a set of transactions. Transactions contain segments, segments contain elements, and elements can be single-valued or be a composite of multiple values.

Deserializing into a generic Document (which is a vec of segments, which themselves are vecs of elements, which is an enum of values or a composite vec of values) works great. Each segment begins with a "header" element, so a DMG segment has different semantic meaning from a REF segment and also contains different quantities and types of fields, and so on.

Within a document, some segments are required, some can be repeated, so a Vec<DmgSegment>, and some are optional, Option<DmgSegment>.

The Problem

To deserialize into transactions with semantic meaning, ones that contain fields like Vec<DmgSegment>, the deserializer needs to be able to backtrack and stop visiting if it's expecting a different segment (e.g. expects a DmgSegment but encounters a REF header.) For a transaction with an Option<DmgSegment> segment, the visitor don't know if the transaction contains that record until we see that DMG header element.

To implement this, I have deserialize_struct implemented, which calls visit_seq. Because visit_seq can return actual values like Ok(None) for the next element, I can catch an UnexpectedSegment error, and in doing so, implement backtracking:

pub struct BufferDeserializer<'de> {
    pub(crate) buffer: Option<&'de str>,
    pub(crate) formatter: X12Formatter,
    pub(crate) delimiter: u8,
}

impl<'de, 'a> Deserializer<'de> for &'a mut BufferDeserializer<'de> {
    fn deserialize_struct<V>(mut self, name: &'static str, fields: &'static [&'static str], visitor: V) -> Result<V::Value, Self::Error> where V: Visitor<'de> {
        let Some(buf) = self.buffer else {
            return Err(X12DeserializerError::EarlyEndOfDocument);
        };
        let (el, rest) = split_string(buf, self.delimiter as char);
        self.buffer = rest;
        visitor.visit_seq(SizedBufferDeserializer {
            de: &mut BufferDeserializer {
                buffer: Some(el),
                delimiter: self.formatter.next_delimiter(self.delimiter),
                formatter: self.formatter,
            },
            remaining: fields.len(),
        })
    }
}

impl<'de> SeqAccess<'de> for BufferDeserializer<'de> {
    type Error = X12DeserializerError;

    fn next_element_seed<T>(&mut self, seed: T) -> Result<Option<T::Value>, Self::Error> where T: DeserializeSeed<'de> {
        if self.buffer.is_none() {
            return Ok(None);
        }
        match seed.deserialize(&mut *self).map(Some) {
            Ok(value) => Ok(value),
            Err(X12DeserializerError::EarlyEndOfDocument) => Ok(None),
            // this is effectively where backtracking happens
            Err(X12DeserializerError::UnexpectedSegment { .. }) => Ok(None),
            Err(e) => Err(e),
        }
    }
}

The above code works great.

However, in the deserialize_option method, the only way to return the equivalent of Ok(None) is by calling visit_none. But we don't know whether it's Some or None valued until we do the visiting. So this code is broken:

    // This code fails to compile
    fn deserialize_option<V>(self, visitor: V) -> Result<V::Value, Self::Error> where V: Visitor<'de> {
        let mut s = visitor.visit_some(self);
        if s.is_ok() {
            s
        } else {
            // EarlyEndOfDocumentError or a UnexpectedSegment error
            // Error is the next line: we can't call `visit_none` because both visit_none and visit_some take ownership of visitor.
            visitor.visit_none() // Compiler error
        }
    }

I tried using visit_seq from deserialize_option, but that fails with an unexpected type error - apparently serde doesn't treat Option as a seq of length zero or one.

Hacky Solution

I hacked a solution using unsafe, which only works because every visitor I've seen is a ZST. Unfortunately, I don't think there's a way to declare to the compiler that a generic V is ZST.

    // this code actually works
    fn deserialize_option<V>(self, visitor: V) -> Result<V::Value, Self::Error> where V: Visitor<'de> {
        // this only works because all the visitors ive seen are ZSTs
        // if they did have state, this would be a disaster
        let insanely_unsafe_copy_of_visitor = unsafe {
            std::ptr::read(&visitor as *const V)
        };
        let mut s = visitor.visit_some(self);
        if s.is_ok() {
            s
        } else {
            insanely_unsafe_copy_of_visitor.visit_none()
        }
    }

The Question

Is there a better way to solve this problem, not using unsafe? Basically, I need a way to return None from deserialize_option, similar to how I'm able to in SeqAccess::next_element_seed.

The most straight forward & backwards compatible way that I see would be if deserialize_option could call visit_seq - perhaps guaranteeing that it's a seq of max one length. But that feels like more indirection than if deserialize_option could return None somehow.

This issue supercedes #2705.