Allow streaming read of text content

tafia / quick-xml

Rust high performance xml reader and writer

MIT License

1.22k stars 236 forks source link

Allow streaming read of text content #260

Open ajtribick opened 3 years ago

ajtribick commented 3 years ago

In some cases it may be useful to read the content of a text node via the Read trait. An example might be if the node contained a large amount of base64-encoded binary data, which the application parses into its own data structures: needing to go via a byte array holding the textual content of the node would needlessly allocate memory.

It would be useful to have a method on Reader that returns an implementation of Read that returns the unescaped/decoded content until the specified end tag.

In the case that the application does not read to the end of the returned Read, I think it makes sense for the Reader to skip ahead to where it would be if the Read had been completed.

Mingun commented 5 months ago

I thought about it and here's what I suggest:

Introduce Event::TextChunk(BytesText) and Event::CDataChunk(BytesCData)
Those events will be emitted only for buffered reader, because in borrowed reader (when you use read_event method) the all XML is already in memory and there are no reasons to split text / CDATA into chunks
When parsing text or CDATA return chunk event after one call to BufRead::fill_buf(), the chunk will contain as much data as the BufRead implementation returns
When we find the end of event (start of markup / ]]>), Event::Text / Event::CData will be emitted
The application should start incremental process when it gets a first chunk event
The application should collect / incrementally process chunk events until it gets a finish event

dralley commented 5 months ago

That sounds reasonable

One point: returning Event::TextChunk(BytesText) would require maintaining a little extra state in the parser so that it remembers that it's still in the middle of parsing an Event::Text. Therefore, if the parser already has to go into a special mode, we could theoretically just return an Event::TextHandle / Event::CDataHandle (note: no payload on those) immediately, after which we could probably let the user grab from the reader a handle that would allow you to Read the contents of the Event::Text / Event::CData directly, up until the start of the next event.

That's a little closer to what the user suggested and might be more convenient. There are downsides to both approaches though.

Mingun commented 5 months ago

Implementation with handle could be difficult to implement. In theory we can return handle here instead of match: https://github.com/tafia/quick-xml/blob/8d38e4cc145b1911d67690a9e076c056ff2e6efe/src/reader/mod.rs#L282-L292

but then in order to continue parsing we should guarantee that text from handle will be consumed. We could implement Drop for handle which will do that, but then we would need to suppress all errors that could occur during read. Implementation with special events seems easy to implement. The Read stream will also could be possible to implement that way:

impl Reader {
  pub fn text_stream(&mut self, first_chunk: impl Into<BytesText>) -> TextStream {
    TextStream {
      reader: self,
      chunk: first_chunk.into(), // CDataChunk also could be passed as a first chunk
    }
  }
}
struct TextStream<'r, 'b, R> {
  reader: &'r mut Reader<R>,
  chunk: BytesText<'b>,
}
impl<'r, 'b, R: BufRead> Read for TextStream<'r, 'b, R> { ... }

but that not as simple as it looks (I tried) and currently my implementation tends to contain many unnecessary copies. I think, it could be possible implement such interface and even effectively, but some major rewrite of parser could be required (in particular, it seems we would need to read from Read instead of BufRead). If TextStream wouldn't be consumed, Reader will continue to return events as usual.

arkkors commented 6 days ago

I have an alternate suggestion. Instead of adding a distinct API, the reader could return Event::Text multiple times depending on the saturation of the underlying Buffer (e.g. probably only affecting BufReader, as already suggested).

Something like this would do the trick:

diff --git a/src/reader/buffered_reader.rs b/src/reader/buffered_reader.rs
index 0136a55..7aa343a 100644
--- a/src/reader/buffered_reader.rs
+++ b/src/reader/buffered_reader.rs
@@ -93,6 +93,9 @@ macro_rules! impl_buffered_source {
                         let used = available.len();
                         self $(.$reader)? .consume(used);
                         read += used as u64;
+                        *position += read;
+
+                        return ReadTextResult::Chunk(&buf[start..]);
                     }
                 }
             }
diff --git a/src/reader/mod.rs b/src/reader/mod.rs
index a05e5bc..08dfdab 100644
--- a/src/reader/mod.rs
+++ b/src/reader/mod.rs
@@ -257,6 +257,10 @@ macro_rules! read_event_impl {
                     }

                     match $reader.read_text($buf, &mut $self.state.offset) $(.$await)? {
+                        ReadTextResult::Chunk(bytes) => {
+                            $self.state.state = ParseState::InsideText;
+                            Ok(Event::Text($self.state.emit_text(bytes)))
+                        }
                         ReadTextResult::Markup(buf) => {
                             $self.state.state = ParseState::InsideMarkup;
                             // Pass `buf` to the next next iteration of parsing loop
@@ -921,6 +925,7 @@ enum ReadTextResult<'r, B> {
     Markup(B),
     /// Contains text block up to start of markup (`<` character).
     UpToMarkup(&'r [u8]),
+    Chunk(&'r [u8]),
     /// Contains text block up to EOF, start of markup (`<` character) was not found.
     UpToEof(&'r [u8]),
     /// IO error occurred.

But maybe I am missing something.

Mingun commented 6 days ago

@arkkors , that is was my original design in https://github.com/tafia/quick-xml/issues/260#issuecomment-2159779818 and I think it is worth to implement it

arkkors commented 6 days ago

@Mingun I'm not exactly sure. You suggest the Event::Text semantics to be preserved, that is Event::Text yields the full text content, or not? This would require to grow the provided buf to the size of the text, which I would like to avoid. I would like to avoid that. Something like this:

diff --git a/src/reader/buffered_reader.rs b/src/reader/buffered_reader.rs
index 0136a55..780126b 10064
--- a/src/reader/buffered_reader.rs
+++ b/src/reader/buffered_reader.rs
@@ -78,21 +78,31 @@ macro_rules! impl_buffered_source {
                         return ReadTextResult::Markup(buf);
                     }
                     Some(i) => {
-                        buf.extend_from_slice(&available[..i]);
+                        if start < i {
+                            buf.resize(i + 1, 0x00);
+                        }
+                        buf[..i].copy_from_slice(&available[..i]);

                         let used = i + 1;
                         self $(.$reader)? .consume(used);
                         read += used as u64;

                         *position += read;
-                        return ReadTextResult::UpToMarkup(&buf[start..]);
+                        return ReadTextResult::UpToMarkup(&buf[..i]);
                     }
                     None => {
-                        buf.extend_from_slice(available);
-
                         let used = available.len();
+
+                        if start < used {
+                            buf.resize(used + 1, 0x00);
+                        }
+                        buf[..used].copy_from_slice(available);
+
                         self $(.$reader)? .consume(used);
                         read += used as u64;
+                        *position += read;
+
+                        return ReadTextResult::Chunk(&buf[..used]);
                     }
                 }
             }

Something like this (except much more cleanly implemented ;)). After reading through the code, I don't think that is easiliy possible with the current Event struct, or is it?

Mingun commented 6 days ago

No, In my suggestion Event::Text will return the last chunk of data. Intermediate chunks would be returned in Event::TechChunk events. So in the end you'll get TextChunk, TextChunk, ..., TextChunk, Text events. If you are not interested in distinguishing between intermediate and final events, you can simply handle them equally.

It is definitely not worth expanding the buffer inside the Reader.