vasi / pixz

Parallel, indexed xz compressor
BSD 2-Clause "Simplified" License
698 stars 61 forks source link

pixz | xz -d corrupts data #65

Open JoernEngel opened 8 years ago

JoernEngel commented 8 years ago

I checked whether pixz and xz format were compatible. Looks like they are. But on a large testfile, I don't get the same data back as before. In particular, the output file is larger than the input file.

All bytes up to the end of input file are identical. Output file simply has additional data. Additional data seems derived from input file, not all zeroes or such.

$ ls -l foo -rw-r--r-- 1 joern joern 4380151296 Apr 26 11:30 foo $ cat foo | pixz -0 | xz -d > bar $ ls -l bar -rw-r--r-- 1 joern joern 4384062468 Apr 26 12:08 bar

pixz 1.0.2 xz (XZ Utils) 5.1.0alpha

Both from Debian, running on x86_64.

JoernEngel commented 8 years ago

While dealing with data corruption, can I suggest running American Fuzzy Lop against every combination of pixz and xz? People found load of bugs in compressors with relatively little effort.

Can I also suggest a major version change when the data corruption bug gets fixed? At this point we have decided not to use pixz at all. Not ever. The danger of someone installing a broken version somewhere and having to spend weeks to track down the problem trumps any speed gains.

wookietreiber commented 8 years ago

@JoernEngel Please try to reproduce with version 1.0.6.

wookietreiber commented 8 years ago

Also, @JoernEngel, is it possible for you to provide the file you are testing with as a download so we can try to work with it? For such a download, please also provide us with the md5 of its original, uncompressed version, and then compress it via another compression tool like gzip to exclude any xz-related issues.

JoernEngel commented 8 years ago

On Tue, Apr 26, 2016 at 12:41:20PM -0700, Christian Krause wrote:

@JoernEngel Please try to reproduce with version 1.0.6.

I tried and failed.

./configure make make install

There is no ./configure checked in. Autoconf is throwing errors: configure.ac:11: error: possibly undefined macro: AM_INIT_AUTOMAKE If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation.

The file I picked contains confidential data. But seriously, I litterally picked the first large file I had lying around and did the most obvious smoke test I could think of. If it was this easy for me to break thing, you should be able to reproduce this with a random dvd image or similar.

Jörn

A surrounded army must be given a way out. -- Sun Tzu

wookietreiber commented 8 years ago

configure is in here. You probably downloaded one of the GitHub-generated tarballs. These do not contain the make dist generated files.

JoernEngel commented 8 years ago

On Tue, Apr 26, 2016 at 01:02:40PM -0700, Christian Krause wrote:

configure is in here. You probably downloaded one of the GitHub-generated tarballs. These do not contain the make dist generated files.

I used the git tree. Anyway, 1.0.6 reproduced the problem, file sizes are identical to 1.0.2 sizes.

And I found yet another fun bug. $ zcat foo.gz | ./pixz -0 | ./pixz -d | wc -c can not seek in input: Illegal seek

Jörn

It's not that I'm so smart, it's just that I stay with problems longer. -- Albert Einstein

JoernEngel commented 8 years ago

pixz -0 | pixz -d seems to work correctly, modulo the "Illegal Seek" message. I suppose that explains how this bug slipped through.

Trouble is that files generated by pixz get detected as regular xz files and using regular xz to uncompress yields an incorrect result. Not sure whether pixz is creating an incorrect archive or xz is violating the spec. Not even sure if there is a spec.

But xz came first and pixz, for better or worse, has to be compatible. Bug-for-bug compatible, if necessary. Sorry, but you have some work to do.

Jörn

Measure. Don't tune for speed until you've measured, and even then don't unless one part of the code overwhelms the rest. -- Rob Pike

wookietreiber commented 8 years ago

I just tried with a large file and couldn't reproduce:

$ dd if=file bs=1M | md5sum
4478+1 records in
4478+1 records out
4696129652 bytes (4.7 GB, 4.4 GiB) copied, 10.9025 s, 431 MB/s
44c5d7d956b71ba1724875c7b0648bf3  -

$ dd if=file bs=1M | pixz -0 | xz -d | md5sum
4478+1 records in
4478+1 records out
4696129652 bytes (4.7 GB, 4.4 GiB) copied, 619.726 s, 7.6 MB/s
44c5d7d956b71ba1724875c7b0648bf3  -

pixz 1.0.6 xz (XZ Utils) 5.2.2 liblzma 5.2.2

wookietreiber commented 8 years ago

@vasi any insights?

vasi commented 8 years ago

@JoernEngel : Could it be that the thing you compressed is a tar file? When it sees a tarball, pixz will by default add an index, so that it's possible to seek to a specific file.

This is done in a safe way, nothing is "corrupted":

If you've discovered a type of file that is not a tarball, but can fool libarchive's checks, please let us know.

Also, please try to be kind. pixz is a tool that other people wrote for free, and just gave to you (and everyone else), and are now supporting for free as well. Nobody is forcing you to use it. If you want to be helpful, that's great! But if you want to act aggressively, you're welcome to find someone offering paid support instead of using our issue tracker.

JoernEngel commented 8 years ago

On Tue, Apr 26, 2016 at 02:00:45PM -0700, Dave Vasilevsky wrote:

@JoernEngel : Could it be that the thing you compressed is a tar file? When it sees a tarball, pixz will by default add an index, so that it's possible to seek to a specific file.

This is done in a safe way, nothing is "corrupted":

  • pixz checks that the file looks like a tarball before doing this, non-tar files won't get an index
  • pixz uses libarchive to check if something is a tarball, which is the same library that /usr/bin/tar uses on OS X and FreeBSD—it's pretty robust.
  • pixz will remove this index when you decompress
  • pixz adds the index after the tar end-of-file marker—so even if you decompress with xz, tar will just ignore this extra data
  • You can turn off the indexing functionality with the -t flag.

If you've discovered a type of file that is not a tarball, but can fool libarchive's checks, please let us know.

Indeed, the file is a tarball and the extra data looks like filenames with some binary (offsets?) in between.

I would still call this corruption. If the archive looks like regular xz, gets processed by regular xz and the result is different from the original input file, that is corruption.

Tarball mode clearly has its appeal. I can see why you did it. But I think it shouldn't be the default. If one variant results in worse performance and the other variant sometimes results in a catastrophy, the default should be slow and safe.

Historically there have been consumer flash devices that worked in similar ways. They assumed a FAT filesystem, would parse the FAT and erase blocks belonging to deleted files. Works great if you actually use FAT. And given the many implementations with subtle incompatibilities, you should use the correct variant.

The problem with autodetection is that things can go wrong and eventually will go wrong.

Anyway, another interesting question is whether the pixz archive can be "poisoned" in such a way that xz ignores the index, but correctly decompresses everything else. But then you would have to change an established format and depend on future versions of xz to handle "poisoned" archives the same way. Might be too far into crazy land.

Jörn

Functionality is an asset, but code is a liability. --Ted Dziuba

JoernEngel commented 8 years ago

On Tue, Apr 26, 2016 at 02:00:45PM -0700, Dave Vasilevsky wrote:

Also, please try to be kind. pixz is a tool that other people wrote for free, and just gave to you (and everyone else), and are now supporting for free as well. Nobody is forcing you to use it. If you want to be helpful, that's great! But if you want to act aggressively, you're welcome to find someone offering paid support instead of using our issue tracker.

Would you consider paid support?

I believe pixz is exactly what we need and would save us money. Sending some of the saved money your way would only be fair.

Jörn

The cost of changing business rules is much more expensive for software than for a secretary. -- unknown

vasi commented 8 years ago

Yeah, unfortunately the xz file format doesn't allow for arbitrary extra data to be carried along, like gzip does. Or at least I haven't found a way to do that.

While nowadays most people probably use pixz for its multi-core support, it was actually originally developed for the indexing feature! Once that was implemented, it turned out that parallelizing things was trivial, so that happened. But indexing is a pretty core part of pixz's mission. Sorry it doesn't suit you. You might want to create an alias pixz='pixz -t' or something, so you don't have to deal with it.

I do think it would be nice if pixz eventually supported a configuration file, so folks could say whether or not they wanted this feature. (Maybe also for features like number of CPUs! I'm sure some people don't appreciate us maxing out all their cores.)

wookietreiber commented 8 years ago

@JoernEngel Would you please verify that the pixz -t -0 | xz -d round trip works as expected for your input file?

JoernEngel commented 8 years ago

On Tue, Apr 26, 2016 at 02:55:46PM -0700, Dave Vasilevsky wrote:

I do think it would be nice if pixz eventually supported a configuration file, so folks could say whether or not they wanted this feature. (Maybe also for features like number of CPUs! I'm sure some people don't appreciate us maxing out all their cores.)

Interesting. That might also explain whey "pixz | xz -d" is slower than "pixz". If xz keeps roughly one cpu busy, you have two effects costing you performance. The first is the OS scheduler moving threads around. The second is that one thread is always behind. If you have points in your code that depend on all threads catching up and making roughly equal progress, the faster threads go idle during those times.

Looks like "pixz | xz -d" keeps about 3.5 cpus busy on my notebook, while "pixz" keeps about 7.5 cpus busy.

Anyway, thank you for this interesting piece of software. Extra thanks for doing it in only 2250 lines.

Jörn

When in doubt, use brute force. -- Ken Thompson

JoernEngel commented 8 years ago

On Tue, Apr 26, 2016 at 03:02:30PM -0700, Christian Krause wrote:

@JoernEngel Would you please verify that the pixz -t -0 | xz -d round trip works as expected?

Verified. "pixz -t -0 | xz -d" works correctly.

Jörn

The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt. -- Bertrand Russell