mhx / dwarfs

A fast high compression read-only file system for Linux, Windows and macOS
GNU General Public License v3.0
2.16k stars 57 forks source link

Add limit to RAM usage on dwarfsextract #104

Closed adminx01 closed 2 years ago

adminx01 commented 2 years ago

This would be very useful. I dont want to give up on settings that are great at compressing such as -B30 but at extraction it can use a ton of RAM depending on the type of file.

mhx commented 2 years ago

I guess this is one of those cases where the feature sounds easy to implement, but it actually turns out to be quite hard.

There is, as usual, a trade-off here. If you're using something like -B30, you're entering an agreement at compression time that every small segment that makes up your file could be taken from any of the previous 30 file system blocks, each of which could be several MBs in size.

There are now a lot of possible ways to extract data from a DwarFS image. The way it's currently done is that you specify a range that you want to read from a file, e.g. you want to read 10,000 bytes from offset 25,0000. Then the DwarFS library determines which blocks it needs to decompress, decompresses them, and either passes you a list of chunks that you can iterate yourself or copies the chunks to a piece of memory. In the first case (list of chunks), this requires all filesystem blocks to be held in memory. In the second case, the blocks can be dismissed as soon as a chunk has been copied. The filesystem cache is used to keep the most recently accessed filesystem blocks around so they don't have to be decompressed over and over again. Now, dwarfsextract uses a pretty simple strategy: iterate over all files and for each file, read the full file and pass it to libarchive. This means that if you have huge files that potentially back-reference quite far into the list of blocks, dwarfsextract will consume a lot of memory.

A very memory efficient strategy that I can think of could be to decompress each filesystem block and slowly piece together the files created on-disk from their individual segments. There are a few drawbacks, though: first, this would be quite a bit of work to implement; second, in order to be efficient, this would require a reverse-lookup table that would potentially consume quite a bit of memory (though that table could be re-built per block); last but not least, this would likely not play well (or at all) with libarchive. So while it'd definitely be an interesting exercise, it'd be a lot of work and would likely make the implementation a lot more complex than it currently is.

One thing that should be quite trivial to implement and might get you pretty close to what you want would be to read individual files in blocks of a configurable size rather than reading them fully. This would definitely help if your use case is not only using -B30, but you're also storing huge files in the DwarFS filesystem.

Sorry for all the rambling, I've pretty much just wrote down my thoughts while pondering the problem. Can you confirm that you're running into this in cases where you're storing large files in the DwarFS filesystem?

adminx01 commented 2 years ago

Yes, my use case includes large files regularly.

At the moment I wrote an automated method for my use case that mounts the image and copies all content. This does not use a lot of RAM and works for me. However, it is messier on my side than simply letting dwarfsextract take care of it.

mhx commented 2 years ago

Yeah, the reason why mounting/copying works is because this won't read the large files at once. I'll try and make that change to dwarfsextract for the next release.

adminx01 commented 2 years ago

Turns out I could not make my copying method very trust worthy, quite the headache to write with a lot of checks. I'm leaving it to you then! (hopes soon)

mhx commented 2 years ago

Hi @miocrime, sorry for the long delay. I think I've got a fix for this, but I need to do some systematic testing to make sure it actually works correctly and there's also a data race I need to investigate.

mhx commented 2 years ago

@miocrime, could you please build/test the latest code from the main branch? 186eb76 should be the fix for this issue.

mhx commented 2 years ago

Here's a quick test extracting a DwarFS image with two files, each ~18 GiB in size.

image

daci12345 commented 2 years ago

I built it and it can extract filesystems, that got OOM for me before, with max 1.5-3GB of ram usage.

mhx commented 2 years ago

Cool! BTW, the amount of memory being used can be (roughly) adjusted using the --cache-size option.

adminx01 commented 2 years ago

I can confirm that RAM usage went down in these special scenarios. From 10-15gib to 1.8 max.

I guess the issue can be closed then.