Closed tniessen closed 4 months ago
Seems reasonable. Mind to crate a PR? I plan to release 0.32 next few days so this change can be included there. error_position()
also should be changed and do not forgot to add changelog entry.
@Mingun I'll give it a try! :)
While trying to approach this issue, I noticed a much bigger issue: all internal offsets are computed using usize
, which breaks on 32-bit systems when processing more than 4 GiB of data in a single document.
Here's a very short reproduction of the issue, beginning with main.rs
(with only quick-xml
as a dependency):
fn main() {
let file = std::fs::File::open("large.xml").unwrap();
let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
let mut buf = Vec::new();
loop {
match reader.read_event_into(&mut buf).unwrap() {
quick_xml::events::Event::Eof => break,
_ => (),
}
}
println!("Done.");
}
You can create such a file large.xml
, for example, like this:
(echo -n '<root>' ; perl -e 'print "<child>text</child>"x300000000' ; echo -n '</root>') > large.xml
Then run the code in 32-bit mode:
$ rustup target add i686-unknown-linux-gnu
$ cargo run --release --target i686-unknown-linux-gnu
Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
Finished `release` profile [optimized] target(s) in 0.27s
Running `target/i686-unknown-linux-gnu/release/test-quick-xml-4gb`
thread 'main' panicked at library/alloc/src/raw_vec.rs:25:5:
capacity overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
As you can see, and as expected from the internal use of usize
for file offsets, parsing fails once the offset within the file exceeds the range of usize
.
Since quick_xml::reader::Reader<R: BufRead>
should not be limited to files that fit into memory, I believe that fixing these related issues would require changing all internal offset computations to use u64
.
Looking at the same thing right now. usize
is platform dependent and I have read that it's 128-bit on RISC-V but so far by looking at rust docs, usize
cannot be larger than 64-bit. So I wonder if having a helper to simply convert all usize
to u64
internally would do the trick, i.e:
fn to_u64(value: usize) -> u64 {
u64::try_from(value).expect("failed to convert usize to u64")
}
Unless I missed something, it looks like we could get away with keeping everything memory related in usize
but then convert to u64
when storing last_error_offset
and offset
.
Yes, internal offsets should be stored in u64
. I'm not sure, is it possible to Vec
to store arrays which length exceeds usize
? If yes, then we should at least have a special handling to break out long Text
/ CData
events into multiple ones. Actually, tag name also could be such absurdly big, but not in a real life. Such improvements, however, can be done in different PRs, but need at least mention such limitations in the documentation.
I did a labor work of prototyping in u64-offsets branch.
is it possible to
Vec
to store arrays which length exceedsusize
?
If I'm understanding the question correctly, then the answer is no. A Vec<T>
with capacity cap
maintains a heap allocation of size mem::size_of::<T>() * cap
, so if cap
exceeded usize
, this would be impossible.
@Mingun cherry pick fixes for tests from https://github.com/pronebird/quick-xml/commit/a393ffb726d02c3ed08746ca20c264a1f8a5dc5b
@pronebird @Mingun I apologize, my previous example did not clear the buffer in-between events properly. This is the correct reproduction:
fn main() {
let file = std::fs::File::open("large.xml").unwrap();
let mut reader = quick_xml::reader::Reader::from_reader(std::io::BufReader::new(file));
let mut buf = Vec::new();
loop {
match reader.read_event_into(&mut buf).unwrap() {
quick_xml::events::Event::Eof => break,
_ => (),
}
buf.clear(); // This was missing above.
}
println!("Done.");
}
Now we see the expected panic:
$ cargo run --target i686-unknown-linux-gnu
Compiling cfg-if v1.0.0
Compiling memchr v2.7.4
Compiling encoding_rs v0.8.34
Compiling quick-xml v0.33.0
Compiling test-quick-xml-4gb v0.1.0 (/home/tniessen/Documents/test-quick-xml-4gb)
Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.13s
Running `target/i686-unknown-linux-gnu/debug/test-quick-xml-4gb`
thread 'main' panicked at /home/tniessen/.cargo/registry/src/index.crates.io-6f17d22bba15001f/quick-xml-0.33.0/src/reader/buffered_reader.rs:286:5:
attempt to add with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
@tniessen, could you test that with u64-offsets branch?
@Mingun Neither debug nor release mode panic with that branch (using the minimal repro from my last comment).
Cool! I would be good to finish that work by thinking how we can test it. Of course, having 4GB+ file in the repository is not an option :smile:. We could create a special BufRead
implementation to stream data until its size exceeded usize::MAX
. Or find are more smart way to test that (set internal attributes to a high value and stream some data?). Maybe I would find time in the nearest future to do that myself, but if someone would like to implement that, it would be great. How much time has the 4GB document been read?
We could create a special BufRead implementation to stream data until its size exceeded usize::MAX
You probably mean exceeds u32::MAX
because on 64-bit platforms usize::MAX
would be the same as u64::MAX
which would result into the same panic.
Graceful solution would be to use checked arithmetics all over the place, catch overflows and return an Err
back but I am not sure if realistically anyone is ever going to hit that limit, so I suppose no need to complicate code more than needed.
I made a some checks and here is the results: | Test | Time |
---|---|---|
With checks + small text | 438.37s (7m 18.37s) | |
With checks + long text | 15.97s | |
Without checks + small text | 411.09s (6m 51.09s) | |
Without checks + long text | 15.40s | |
Without checks + small text + --release |
28.29s | |
Without checks + long text + --release |
0.45s |
"small text" is the
<child>text</child>
"long text" is the Lorem Ipsum with 100 paragraphs generated by https://ru.lipsum.com/feed/html, 61475 bytes long in the following frame:
<content description="Text generated by https://ru.lipsum.com/feed/html: 100 paragraphs">
Lorem Ipsum...
</content>
"With checks" is the assert_eq!
of the names returned in the Start
and End
events, "without checks" -- without them.
The source of XML:
struct Fountain<'a> {
/// That piece of data repeated infinitely...
chunk: &'a [u8],
/// Part of `chunk` that was consumed by BufRead impl
consumed: usize,
/// The overall count of read bytes
overall_read: u64,
}
impl<'a> io::Read for Fountain<'a> {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
let available = &self.chunk[self.consumed..];
let len = buf.len().min(available.len());
let (portion, _) = available.split_at(len);
buf.copy_from_slice(portion);
Ok(len)
}
}
impl<'a> io::BufRead for Fountain<'a> {
fn fill_buf(&mut self) -> io::Result<&[u8]> {
Ok(&self.chunk[self.consumed..])
}
fn consume(&mut self, amt: usize) {
self.consumed += amt;
if self.consumed == self.chunk.len() {
self.consumed = 0;
}
self.overall_read += amt as u64;
}
}
Test is ended when overall_read >= u32::MAX
.
Because I ran tests on x86_64, I did not run into the problems, so the only question remains how to test that on the correct target to hit the error. I think that should be possible to ensure that we have such target on CI.
Thank you @Mingun and @pronebird for your work on this!
I also just released 0.34.0 with this fix.
quick_xml::reader::Reader<R: BufRead>
allows reading arbitrarily large files without allocating much memory. I am using this in combination withflate2::read::GzDecoder
in order to decompress and simultaneously parse very large XML files (several gigabytes per document) without having to store the decompressed file (either in memory or on disk).Unfortunately, because offsets within the document are computed as
usize
internally, this causesquick-xml
to panic when the document size exceeds 4 GiB on 32-bit targets.Notably,
buffer_position()
returnsusize
, which is not necessarily sufficient for representing the correct position if the input document size exceeds 4 GiB.Rust's built-in
std::io
types typically useu64
for file sizes (andi64
for relative offsets within files). Hence, I believe it would be appropriate to useu64
for internal computations (and to return it frombuffer_position()
etc.) as well.