tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.18k stars 235 forks source link

Text events are not emitted when text node ends exactly at BufferedReader buffer border #774

Closed randomdude999 closed 3 months ago

randomdude999 commented 3 months ago

Perhaps the title isn't quite clear here. As a concrete example, the document written by this python script:

open('test.xml', 'w').write('<asdf>' + 'a' * (8192 - 6) + '</asdf>')

(i.e., a document with a single tag, whose contents end at offset 8192) will cause quick-xml to emit only a Start and End event, with no Text event in the middle. This can be verified with the following rust program, which simply dumps all received events to stderr:

use quick_xml::events::Event;
use quick_xml::reader::Reader;

fn main() {
    let mut reader = Reader::from_file("test.xml").unwrap();
    let mut buf = Vec::new();
    loop {
        let ev = reader.read_event_into(&mut buf);
        if matches!(ev, Ok(Event::Eof)) { break; }
        dbg!(&ev);
    }
}

This prints:

[src/main.rs:10:9] &ev = Ok(
    Start(
        BytesStart { buf: Borrowed("asdf"), name_len: 4 },
    ),
)
[src/main.rs:10:9] &ev = Ok(
    End(
        BytesEnd { name: Borrowed("asdf") },
    ),
)

The root cause seems to be in the read_text of buffered_reader.rs. In this case, reading the text data requires two loop iterations. On the first iteration, < is not found in the data, so the entire data is pushed onto buf and the loop continues. On the second iteration, < is found at position 0, which triggers the special case in the code that returns ReadTextResult::Markup instead of ReadTextResult::UpToMarkup, and does not indicate that text data was also present.

I think the right fix here would be to only check the zero-position case on the first iteration of the loop.

Mingun commented 3 months ago

I just released 0.35.0 with this fix.