mdsteele / rust-cfb

Rust library for reading/writing Compound File Binary (structured storage) files
MIT License
46 stars 20 forks source link

Reading is very slow for larger files #57

Open jon-zu opened 2 months ago

jon-zu commented 2 months ago

I'm using the msi crate, but after some profiling, I think this crate might be the better place to report that.

Code:

        let setup_file = BufReader::new(File::open("out_is/Setup.msi").unwrap());
        let mut msi = msi::Package::open(setup_file).unwrap();
        let mut stream = msi.read_stream("Data1.cab").unwrap();
        let mut max = stream.take(1024 * 1024 * 50); 
        let mut out_dir = BufWriter::new(File::create("out_is/Data1.cab").unwrap());
        std::io::copy(&mut max, &mut out_dir).unwrap();

Reading 50 Mb of that cab stream takes roughly one minute on my pc. I've also created a flame graph, but I'm not really sure where the problem lies. Is the loop in Chain::new taking a lot of time maybe? flamegraph

mdsteele commented 2 months ago

Just to check, is this timing coming from a --release build?

jon-zu commented 2 months ago

Just to check, is this timing coming from a --release build?

Yes, I can see later today If I can make a small, reproducible example. I'm making a small tool to extract InstallShield installer, where I extract the .msi from and exe.

jon-zu commented 2 months ago

I've built a small example here. 7z can extract this in a few seconds, It seems the StreamReader is slower than 1mb/s.

MSI file(~1gb): https://drive.google.com/file/d/1TqfK1UDhhv1pbfKZe0yWDAHZ-_azVsoT/view?usp=drive_link

Flamegraph for the example(I've disabled printing for that one): flamegraph

It seems the svg is not really working here, I've also added It to the repo.

mdsteele commented 1 month ago

Hmm...it looks like the current implementation re-creates the Chain struct every time it needs to refill the Stream's internal buffer, and then throws the Chain away again. And Chain::new walks the entire FAT chain to the end, even if the caller only wants to read some piece of data from the beginning or middle of the chain. So we end up with this quadratic behavior where, to read all of a very long stream, we have to walk the same chain sectors over and over and over again.

Fixing this might not be straightforward; IIRC part of the reason the Stream doesn't persist the Chain is that sectors could potentially be added/removed while the Stream is still open, and it should be robust against that.

One possibility to experiment with would be that perhaps the Chain shouldn't be keeping a list of sector_ids at all; it should just walk the FAT chain on demand when instructed to seek to a specific position in the chain.