kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.
GNU Lesser General Public License v3.0
660 stars 37 forks source link

data integrity failure on truncated stream #106

Open kilobyte opened 1 year ago

kilobyte commented 1 year ago

Unlike other Unix compressors, bzip3 fails to notice data truncation if the compressed stream ends at a block boundary. There's no way to distinguish such a truncation, leading to silent data loss.

Furthermore, while compressed block boundaries are "random" by length, the timing pattern makes it very likely such truncations happen naturally, with no malice involved. The library writes a series of blocks, takes a long while processing a new series, and only then resumes output. Thus, any mishap (crash, power loss, a network failure, a pendrive being ejected, a backup snapshot, OOM, a timeout, etc) will very likely make the file appear to be correctly terminated. This is compounded by the tool forcing a flush at a block boundary — something normally beneficial due to cache locality, but here the block tail stuck in stdio buffers would at least make the error noisy.

Alas, while it'd be easy to add such a marker (a block header with length=0 or a magic value >511MB), any such change would break bytestream compat, thus breaking compatibility with current version of the library.

kspalaiologos commented 1 year ago

One way to prevent this situation from happening would be immediately testing the file using bzip3 -t to determine if the compressed size matches the decompressed size.

kilobyte commented 1 year ago

There's no record of the decompressed size anywhere; in fact there's even no way to know it beforehand if the input comes from a pipe or /proc.