hpc / charliecloud

Now hosted on GitLab.
https://gitlab.com/charliecloud/main
Apache License 2.0
313 stars 61 forks source link

`ch-convert`: detect tarbombs without reading entire archive #1325

Open reidpr opened 2 years ago

reidpr commented 2 years ago

Tarballs have no index, so listing all members requires reading the entire archive (and unpacking is a second full read). This issue is to figure out a way to detect a tarbomb without listing all members.

History:

  1. We (in 0.26) unpacked the tarball into a subdirectory regardless of whether it was a tarbomb and then rearranged the unpacked files if needed.

  2. This caused hassles with directories called /*/dev being cleared, so we then (in PR #1269, merged for 0.27) listed only the first 1024 members, assuming if if the first component was the same that far, it was the same for the whole archive. That is, this did two passes but the first was highly abbreviated. But, this breaks on Spack images if /spack is first in the archive because that directory can contain tens of thousands of files (maybe more).

  3. So, as of PR #1320 in 0.27, we use a full pass through the archive to test its tarbomb-ness.

The performance penalty of two full passes is non-trivial, perhaps 2× in some quick testing (unsurprisingly), so this is a substantial performance regression. Whether or not the archive is in the disk cache seems to not matter much.

I have not tested decompressing once and then reading the decompressed version twice. I'm pretty sure we spend almost all the read time in gzip, but this approach adds time to write the uncompressed tarball in addition to space to store it, so it's not an appealing approach to me.

There may be opportunity to short-circuit if the archive is a tarbomb, i.e. stop reading the archive as soon as differing first member is encountered. I do suspect that most archives we encounter will be tarbombs; however, bad luck with Spack containers may still be slow.

reidpr commented 2 years ago

pigz(1) does not parallelize decompression because it “can’t be parallelized, at least not without specially prepared deflate streams for that purpose”

libdeflate does provide an optimized gunzip(1) that's twice as fast or more as GNU gunzip(1), but it reads the whole file into memory first, which will be a problem for the large archives that make this bug most annoying.

There is a research project that does claim parallel gunzipping, but it’s quite immature and is based on libdeflate, so I assume it reads the whole file into memory.

reidpr commented 2 years ago

OK on the Spack archive that led down this path, the performance penalty is more like 20%.