`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB

tniessen commented 4 months ago

quick_xml::reader::Reader<R: BufRead> allows reading arbitrarily large files without allocating much memory. I am using this in combination with flate2::read::GzDecoder in order to decompress and simultaneously parse very large XML files (several gigabytes per document) without having to store the decompressed file (either in memory or on disk).

Unfortunately, because offsets within the document are computed as usize internally, this causes quick-xml to panic when the document size exceeds 4 GiB on 32-bit targets.

Notably, buffer_position() returns usize, which is not necessarily sufficient for representing the correct position if the input document size exceeds 4 GiB.

Rust's built-in std::io types typically use u64 for file sizes (and i64 for relative offsets within files). Hence, I believe it would be appropriate to use u64 for internal computations (and to return it from buffer_position() etc.) as well.

Mingun commented 4 months ago

Seems reasonable. Mind to crate a PR? I plan to release 0.32 next few days so this change can be included there. error_position() also should be changed and do not forgot to add changelog entry.

tniessen commented 4 months ago

@Mingun I'll give it a try! :)

tniessen commented 4 months ago

While trying to approach this issue, I noticed a much bigger issue: all internal offsets are computed using usize, which breaks on 32-bit systems when processing more than 4 GiB of data in a single document.

Here's a very short reproduction of the issue, beginning with main.rs (with only quick-xml as a dependency):

fn main() {
    let file = std::fs::File::open("large.xml").unwrap();
    let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf).unwrap() {
            quick_xml::events::Event::Eof => break,
            _ => (),
        }
    }
    println!("Done.");
}

You can create such a file large.xml, for example, like this:

(echo -n '<root>' ; perl -e 'print "<child>text</child>"x300000000' ; echo -n '</root>') > large.xml

Then run the code in 32-bit mode:

$ rustup target add i686-unknown-linux-gnu
$ cargo run --release --target i686-unknown-linux-gnu
   Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
    Finished `release` profile [optimized] target(s) in 0.27s
     Running `target/i686-unknown-linux-gnu/release/test-quick-xml-4gb`
thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

As you can see, and as expected from the internal use of usize for file offsets, parsing fails once the offset within the file exceeds the range of usize.

Since quick_xml::reader::Reader<R: BufRead> should not be limited to files that fit into memory, I believe that fixing these related issues would require changing all internal offset computations to use u64.

pronebird commented 4 months ago

Looking at the same thing right now. usize is platform dependent and I have read that it's 128-bit on RISC-V but so far by looking at rust docs, usize cannot be larger than 64-bit. So I wonder if having a helper to simply convert all usize to u64 internally would do the trick, i.e:

fn to_u64(value: usize) -> u64 {
    u64::try_from(value).expect("failed to convert usize to u64")
}

Unless I missed something, it looks like we could get away with keeping everything memory related in usize but then convert to u64 when storing last_error_offset and offset.

Mingun commented 4 months ago

Yes, internal offsets should be stored in u64. I'm not sure, is it possible to Vec to store arrays which length exceeds usize? If yes, then we should at least have a special handling to break out long Text / CData events into multiple ones. Actually, tag name also could be such absurdly big, but not in a real life. Such improvements, however, can be done in different PRs, but need at least mention such limitations in the documentation.

I did a labor work of prototyping in u64-offsets branch.

tniessen commented 4 months ago

is it possible to Vec to store arrays which length exceeds usize?

If I'm understanding the question correctly, then the answer is no. A Vec<T> with capacity cap maintains a heap allocation of size mem::size_of::<T>() * cap, so if cap exceeded usize, this would be impossible.

pronebird commented 4 months ago

@Mingun cherry pick fixes for tests from https://github.com/pronebird/quick-xml/commit/a393ffb726d02c3ed08746ca20c264a1f8a5dc5b

tniessen commented 4 months ago

@pronebird @Mingun I apologize, my previous example did not clear the buffer in-between events properly. This is the correct reproduction:

fn main() {
    let file = std::fs::File::open("large.xml").unwrap();
    let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
    let mut buf = Vec::new();
    loop {
        match reader.read_event_into(&mut buf).unwrap() {
            quick_xml::events::Event::Eof => break,
            _ => (),
        }
        buf.clear();  // This was missing above.
    }
    println!("Done.");
}

Now we see the expected panic:

$ cargo run --target i686-unknown-linux-gnu
   Compiling cfg-if v1.0.0
   Compiling memchr v2.7.4
   Compiling encoding_rs v0.8.34
   Compiling quick-xml v0.33.0
   Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.13s
     Running `target/i686-unknown-linux-gnu/debug/test-quick-xml-4gb`
thread 'main' panicked at /home/tniessen/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quick-xml-0.33.0/src/reader/buffered_reader.rs:286:5:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Mingun commented 4 months ago

@tniessen, could you test that with u64-offsets branch?

tniessen commented 4 months ago

@Mingun Neither debug nor release mode panic with that branch (using the minimal repro from my last comment).

Mingun commented 4 months ago

Cool! I would be good to finish that work by thinking how we can test it. Of course, having 4GB+ file in the repository is not an option :smile:. We could create a special BufRead implementation to stream data until its size exceeded usize::MAX. Or find are more smart way to test that (set internal attributes to a high value and stream some data?). Maybe I would find time in the nearest future to do that myself, but if someone would like to implement that, it would be great. How much time has the 4GB document been read?

pronebird commented 4 months ago

We could create a special BufRead implementation to stream data until its size exceeded usize::MAX

You probably mean exceeds u32::MAX because on 64-bit platforms usize::MAX would be the same as u64::MAX which would result into the same panic.

Graceful solution would be to use checked arithmetics all over the place, catch overflows and return an Err back but I am not sure if realistically anyone is ever going to hit that limit, so I suppose no need to complicate code more than needed.

Mingun commented 4 months ago

I made a some checks and here is the results:	Test	Time
With checks + small text	438.37s (7m 18.37s)
With checks + long text	15.97s
Without checks + small text	411.09s (6m 51.09s)
Without checks + long text	15.40s
Without checks + small text + `--release`	28.29s
Without checks + long text + `--release`	0.45s

"small text" is the

<child>text</child>

"long text" is the Lorem Ipsum with 100 paragraphs generated by https://ru.lipsum.com/feed/html, 61475 bytes long in the following frame:

<content description="Text generated by https://ru.lipsum.com/feed/html: 100 paragraphs">
Lorem Ipsum...
</content>

"With checks" is the assert_eq! of the names returned in the Start and End events, "without checks" -- without them.

The source of XML:

struct Fountain<'a> {
    /// That piece of data repeated infinitely...
    chunk: &'a [u8],
    /// Part of `chunk` that was consumed by BufRead impl
    consumed: usize,
    /// The overall count of read bytes
    overall_read: u64,
}
impl<'a> io::Read for Fountain<'a> {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let available = &self.chunk[self.consumed..];
        let len = buf.len().min(available.len());
        let (portion, _) = available.split_at(len);

        buf.copy_from_slice(portion);
        Ok(len)
    }
}
impl<'a> io::BufRead for Fountain<'a> {
    fn fill_buf(&mut self) -> io::Result<&[u8]> {
        Ok(&self.chunk[self.consumed..])
    }

    fn consume(&mut self, amt: usize) {
        self.consumed += amt;
        if self.consumed == self.chunk.len() {
            self.consumed = 0;
        }
        self.overall_read += amt as u64;
    }
}

Test is ended when overall_read >= u32::MAX.

Because I ran tests on x86_64, I did not run into the problems, so the only question remains how to test that on the correct target to hit the error. I think that should be possible to ensure that we have such target on CI.

tniessen commented 4 months ago

Thank you @Mingun and @pronebird for your work on this!

Mingun commented 4 months ago

I also just released 0.34.0 with this fix.

tafia / quick-xml

`Reader<R: BufRead>` unnecessarily panics on 32-bit targets when the document size exceeds 4 GiB #751