nodejs / build

Better build and test infra for Node.
506 stars 167 forks source link

Use different compression format than .xz? (Considered unreliable) #748

Closed fhemberger closed 6 years ago

fhemberger commented 7 years ago

Just stumbled over this article, maybe we should consider changing the compression format for builds: http://www.nongnu.org/lzip/xz_inadequate.html

Quote:

Xz is fragmented by design. Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed, which makes the use of xz inadvisable not only for long-term archiving, but also for data sharing and for free software distribution. Xz is also unreasonably extensible; it has room for trillions of compression algorithms, but currently only supports one, LZMA2, which in spite of its name is not an improved version of LZMA, but an unsafe container for LZMA data. Such egregious level of extensibility makes corruption both more probable and more difficult to recover from. Additionally, the xz format lacks a version number field, which makes xz's extensibility problematic.

Xz fails to protect critical fields like length fields and flags signalling the presence of optional fields. Xz uses variable-length integers unsafely, specially when they are used to store the size of other fields or when they are concatenated together. These defects make xz fragile, meaning that most of the times when it reports a false positive, the decoder state is so mangled that it is unable to recover the decompressed data.

Error detection in the xz format is less accurate than in bzip2, gzip and lzip formats mainly because of false positives, and specially if an overkill check sequence like SHA-256 is used in xz. Another cause of false positives is that xz tries to detect errors in parts of the compressed file that do not affect decompression, like the padding added to keep the useless 4 byte alignment. In total xz reports several times more false positives than bzip2, gzip or lzip, and every false positive may result in unnecessary loss of data.

All these defects and design errors reduce the value of xz as a general-purpose format because anybody wanting to archive a file already compressed in xz format will have to either leave it as-is and face a larger risk of losing the data, or waste time recompressing the data into a format more suitable for long-term archiving.

jbergstroem commented 7 years ago

Meh. I've seen that before as well. My take is that we're going to have a hard time to find a universally installed compression algorithm that is as efficient. If it works for the linux kernel and most linux distributions, it works for me.

If data validation is a bigger concern to you, using git and/or .tar.gz is likely a better choice.

maclover7 commented 6 years ago

Closing for now as something that would probably be nice to have, but seems to not currently be within our means. If someone wants to tackle this, please feel free to reopen.