Open ajtribick opened 3 years ago
I thought about it and here's what I suggest:
Event::TextChunk(BytesText)
and Event::CDataChunk(BytesCData)
read_event
method) the all XML is already in memory and there are no reasons to split text / CDATA into chunksBufRead::fill_buf()
, the chunk will contain as much data as the BufRead
implementation returns]]>
), Event::Text
/ Event::CData
will be emittedThat sounds reasonable
One point: returning Event::TextChunk(BytesText)
would require maintaining a little extra state in the parser so that it remembers that it's still in the middle of parsing an Event::Text
. Therefore, if the parser already has to go into a special mode, we could theoretically just return an Event::TextHandle
/ Event::CDataHandle
(note: no payload on those) immediately, after which we could probably let the user grab from the reader a handle that would allow you to Read
the contents of the Event::Text
/ Event::CData
directly, up until the start of the next event.
That's a little closer to what the user suggested and might be more convenient. There are downsides to both approaches though.
Implementation with handle could be difficult to implement. In theory we can return handle here instead of match
:
https://github.com/tafia/quick-xml/blob/8d38e4cc145b1911d67690a9e076c056ff2e6efe/src/reader/mod.rs#L282-L292
but then in order to continue parsing we should guarantee that text from handle will be consumed. We could implement Drop
for handle which will do that, but then we would need to suppress all errors that could occur during read. Implementation with special events seems easy to implement. The Read
stream will also could be possible to implement that way:
impl Reader {
pub fn text_stream(&mut self, first_chunk: impl Into<BytesText>) -> TextStream {
TextStream {
reader: self,
chunk: first_chunk.into(), // CDataChunk also could be passed as a first chunk
}
}
}
struct TextStream<'r, 'b, R> {
reader: &'r mut Reader<R>,
chunk: BytesText<'b>,
}
impl<'r, 'b, R: BufRead> Read for TextStream<'r, 'b, R> { ... }
but that not as simple as it looks (I tried) and currently my implementation tends to contain many unnecessary copies. I think, it could be possible implement such interface and even effectively, but some major rewrite of parser could be required (in particular, it seems we would need to read from Read
instead of BufRead
). If TextStream
wouldn't be consumed, Reader
will continue to return events as usual.
I have an alternate suggestion. Instead of adding a distinct API, the reader could return Event::Text
multiple times depending on the saturation of the underlying Buffer (e.g. probably only affecting BufReader
, as already suggested).
Something like this would do the trick:
diff --git a/src/reader/buffered_reader.rs b/src/reader/buffered_reader.rs
index 0136a55..7aa343a 100644
--- a/src/reader/buffered_reader.rs
+++ b/src/reader/buffered_reader.rs
@@ -93,6 +93,9 @@ macro_rules! impl_buffered_source {
let used = available.len();
self $(.$reader)? .consume(used);
read += used as u64;
+ *position += read;
+
+ return ReadTextResult::Chunk(&buf[start..]);
}
}
}
diff --git a/src/reader/mod.rs b/src/reader/mod.rs
index a05e5bc..08dfdab 100644
--- a/src/reader/mod.rs
+++ b/src/reader/mod.rs
@@ -257,6 +257,10 @@ macro_rules! read_event_impl {
}
match $reader.read_text($buf, &mut $self.state.offset) $(.$await)? {
+ ReadTextResult::Chunk(bytes) => {
+ $self.state.state = ParseState::InsideText;
+ Ok(Event::Text($self.state.emit_text(bytes)))
+ }
ReadTextResult::Markup(buf) => {
$self.state.state = ParseState::InsideMarkup;
// Pass `buf` to the next next iteration of parsing loop
@@ -921,6 +925,7 @@ enum ReadTextResult<'r, B> {
Markup(B),
/// Contains text block up to start of markup (`<` character).
UpToMarkup(&'r [u8]),
+ Chunk(&'r [u8]),
/// Contains text block up to EOF, start of markup (`<` character) was not found.
UpToEof(&'r [u8]),
/// IO error occurred.
But maybe I am missing something.
@arkkors , that is was my original design in https://github.com/tafia/quick-xml/issues/260#issuecomment-2159779818 and I think it is worth to implement it
@Mingun I'm not exactly sure. You suggest the Event::Text
semantics to be preserved, that is Event::Text
yields the full text content, or not? This would require to grow the provided buf
to the size of the text, which I would like to avoid. I would like to avoid that. Something like this:
diff --git a/src/reader/buffered_reader.rs b/src/reader/buffered_reader.rs
index 0136a55..780126b 10064
--- a/src/reader/buffered_reader.rs
+++ b/src/reader/buffered_reader.rs
@@ -78,21 +78,31 @@ macro_rules! impl_buffered_source {
return ReadTextResult::Markup(buf);
}
Some(i) => {
- buf.extend_from_slice(&available[..i]);
+ if start < i {
+ buf.resize(i + 1, 0x00);
+ }
+ buf[..i].copy_from_slice(&available[..i]);
let used = i + 1;
self $(.$reader)? .consume(used);
read += used as u64;
*position += read;
- return ReadTextResult::UpToMarkup(&buf[start..]);
+ return ReadTextResult::UpToMarkup(&buf[..i]);
}
None => {
- buf.extend_from_slice(available);
-
let used = available.len();
+
+ if start < used {
+ buf.resize(used + 1, 0x00);
+ }
+ buf[..used].copy_from_slice(available);
+
self $(.$reader)? .consume(used);
read += used as u64;
+ *position += read;
+
+ return ReadTextResult::Chunk(&buf[..used]);
}
}
}
Something like this (except much more cleanly implemented ;)). After reading through the code, I don't think that is easiliy possible with the current Event
struct, or is it?
No, In my suggestion Event::Text
will return the last chunk of data. Intermediate chunks would be returned in Event::TechChunk
events. So in the end you'll get TextChunk
, TextChunk
, ..., TextChunk
, Text
events. If you are not interested in distinguishing between intermediate and final events, you can simply handle them equally.
It is definitely not worth expanding the buffer inside the Reader
.
In some cases it may be useful to read the content of a text node via the
Read
trait. An example might be if the node contained a large amount of base64-encoded binary data, which the application parses into its own data structures: needing to go via a byte array holding the textual content of the node would needlessly allocate memory.It would be useful to have a method on
Reader
that returns an implementation ofRead
that returns the unescaped/decoded content until the specified end tag.In the case that the application does not read to the end of the returned
Read
, I think it makes sense for theReader
to skip ahead to where it would be if theRead
had been completed.