tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.18k stars 235 forks source link

Add ability to deserialize serde types from `Reader` #611

Open ndtoan96 opened 1 year ago

ndtoan96 commented 1 year ago

When working with deeply nested xml, most of the time, we are only interested in a portion of the whole tree close to the leaf node. My idea is to extract the string of the target node and deserialize it with serde. But I can't find any convenient way to do that.

Currently I use read_text to get the inner content of the node and add the start and end tag manually, but then the code looks really weird, especially when the node has many attributes. It would be great if there's a method (read_node or something) to do that.

By the way, is there any reason why read_text is not implemented for Reader<File>?

Mingun commented 1 year ago

Having a deserialize method for Reader that would be able to deserialize piece of XML into a type using serde from current position is definitely a feature I also want -- as a counterpart to #610. Implementation, however, not so simple, because serde deserializer requires some (potentially unbounded) lookahead, therefore we need to buffer events somewhere.

The possible API could look something like this:

impl<'a> Reader<&'a [u8]> {
  fn deserialize<T>(&mut self, seed: Event<'a>) -> Result<T, DeError>
  where
    T: Deserialize<'a>,
  {}
}

impl<R: Read> Reader<R> {
  fn deserialize_into<'de, T>(&mut self, seed: Event<'de>, buffer: &'de mut Vec<u8>) -> Result<T, DeError>
  where
    T: Deserialize<'de>,
  {}
}

The seed here is an event that we got from Reader in typical read cycle which likely will be a part of the type that we want to deserialize.

Another possible API (very schematic):

impl<R> Reader<R> {
  fn deserializer(&mut self, seed: Event) -> FragmentDeserializer { ... }
}

struct FragmentDeserializer { ... }
impl FragmentDeserializer {
  fn deserialize<T>(self) -> Result<T, DeError>
  where
    T: Deserialize<'a>,
  {}
  fn deserialize_into<'de, T>(self, buffer: &'de mut Vec<u8>) -> Result<T, DeError>
  where
    T: Deserialize<'de>,
  {}
}

Another question, in what state we should leave Reader if deserialization fails? Or how we should provide access to an events that was consumed during lookahead, but not used to deserialize the final type? What if we want to call deserialize twice -- then we should to consider lookaheaded events from the first deserialize call. Probably we need a more generic API:

impl<R> Reader<R> {
  /// Convert to a reader that can store up to `count` events in the internal buffer
  fn lookahead(self, count: usize) -> LookaheadReader<R> { ... }
}

impl<'de, 'a, R> Deserializer<'de> for &'a mut LookaheadReader<R> { ... }
Mingun commented 1 year ago

By the way, is there any reason why read_text is not implemented for Reader<File>?

It is not trivial to do that, because we cannot just reuse read_to_end_into method -- it stores into buffer only content of the tags, but skips markup characters (<, > and so on). The attempts to implement it tracked in #483.

tstenner commented 1 month ago

I would also like this. Go makes it easy to mix pull based parsing with a state machine and deserializing structs:

    decoder := xml.NewDecoder(r.Body)
    decoder.Strict = true
    for {
        switch se := t.(type) {
        case xml.StartElement:
            level++
            switch se.Name.Local {
            case "fooTag":
                var req schema.FooRequest
                decoder.DecodeElement(&req, &se)
                // do stuff
            case "barRequest":
                var req schema.BarRequest
                err = decoder.DecodeElement(&req, &se)
                // do stuff
                     }
        case xml.EndElement:
            level--
        }
    }
}

I could live with an implementation that ties the lifetime of the Reader and the deserialized object to the source lifetime, i.e. only applies to readers backed by a &str.

LiosK commented 1 month ago

By any chance is it possible to implement something like:

for Reader?

I would like to deserialize some specific <elem> ... </elem> ranges in a large document. To do this currently, I read events until the end tag, write them using Writer to a separate buffer, and then pass the buffer to quick_xml::de::from_str(). It's apparently not efficient because it parses XML twice and serializes it once as well. It would be great if Reader deserialized the elements when it first read the content up to the end tag.

Mingun commented 1 month ago

As I already explained, serde deserializer requires lookahead which Reader does not provide. The plan is:

LiosK commented 1 month ago

Cool!