Open ndtoan96 opened 1 year ago
Having a deserialize
method for Reader
that would be able to deserialize piece of XML into a type using serde from current position is definitely a feature I also want -- as a counterpart to #610. Implementation, however, not so simple, because serde deserializer requires some (potentially unbounded) lookahead, therefore we need to buffer events somewhere.
The possible API could look something like this:
impl<'a> Reader<&'a [u8]> {
fn deserialize<T>(&mut self, seed: Event<'a>) -> Result<T, DeError>
where
T: Deserialize<'a>,
{}
}
impl<R: Read> Reader<R> {
fn deserialize_into<'de, T>(&mut self, seed: Event<'de>, buffer: &'de mut Vec<u8>) -> Result<T, DeError>
where
T: Deserialize<'de>,
{}
}
The seed
here is an event that we got from Reader
in typical read cycle which likely will be a part of the type that we want to deserialize.
Another possible API (very schematic):
impl<R> Reader<R> {
fn deserializer(&mut self, seed: Event) -> FragmentDeserializer { ... }
}
struct FragmentDeserializer { ... }
impl FragmentDeserializer {
fn deserialize<T>(self) -> Result<T, DeError>
where
T: Deserialize<'a>,
{}
fn deserialize_into<'de, T>(self, buffer: &'de mut Vec<u8>) -> Result<T, DeError>
where
T: Deserialize<'de>,
{}
}
Another question, in what state we should leave Reader
if deserialization fails? Or how we should provide access to an events that was consumed during lookahead, but not used to deserialize the final type? What if we want to call deserialize
twice -- then we should to consider lookaheaded events from the first deserialize
call. Probably we need a more generic API:
impl<R> Reader<R> {
/// Convert to a reader that can store up to `count` events in the internal buffer
fn lookahead(self, count: usize) -> LookaheadReader<R> { ... }
}
impl<'de, 'a, R> Deserializer<'de> for &'a mut LookaheadReader<R> { ... }
By the way, is there any reason why
read_text
is not implemented forReader<File>
?
It is not trivial to do that, because we cannot just reuse read_to_end_into
method -- it stores into buffer only content of the tags, but skips markup characters (<
, >
and so on). The attempts to implement it tracked in #483.
I would also like this. Go makes it easy to mix pull based parsing with a state machine and deserializing structs:
decoder := xml.NewDecoder(r.Body)
decoder.Strict = true
for {
switch se := t.(type) {
case xml.StartElement:
level++
switch se.Name.Local {
case "fooTag":
var req schema.FooRequest
decoder.DecodeElement(&req, &se)
// do stuff
case "barRequest":
var req schema.BarRequest
err = decoder.DecodeElement(&req, &se)
// do stuff
}
case xml.EndElement:
level--
}
}
}
I could live with an implementation that ties the lifetime of the Reader and the deserialized object to the source lifetime, i.e. only applies to readers backed by a &str
.
By any chance is it possible to implement something like:
fn deserialize_to_end(&'de mut self, end: QName<'_>) -> Result<T<'de>, E>
fn deserialize_to_end_into(&mut self, end: QName<'_>, buf: &'de mut Vec<u8>) -> Result<T<'de>, E>
for Reader
?
I would like to deserialize some specific <elem> ... </elem>
ranges in a large document. To do this currently, I read events until the end tag, write them using Writer
to a separate buffer, and then pass the buffer to quick_xml::de::from_str()
. It's apparently not efficient because it parses XML twice and serializes it once as well. It would be great if Reader
deserialized the elements when it first read the content up to the end tag.
As I already explained, serde deserializer requires lookahead which Reader
does not provide. The plan is:
Reader
to RawReader
. Each RawReader
will handle one XML source (which is called entities in XML spec)Reader
with the stack of RawReader
s. That new Reader
will able to handle DTD references to other entities. Because of that it will naturally have storage inside (no *_into
methods anymore). That also would mean that it can store cached eventsDeserializer
Reader
Cool!
When working with deeply nested xml, most of the time, we are only interested in a portion of the whole tree close to the leaf node. My idea is to extract the string of the target node and deserialize it with serde. But I can't find any convenient way to do that.
Currently I use
read_text
to get the inner content of the node and add the start and end tag manually, but then the code looks really weird, especially when the node has many attributes. It would be great if there's a method (read_node
or something) to do that.By the way, is there any reason why
read_text
is not implemented forReader<File>
?