nemequ / squash-corpus

Designing a new corpus for lossless general-purpose compression
15 stars 2 forks source link

Highly-compressable data with long run lengths #12

Open ivan-tkatchev opened 9 years ago

ivan-tkatchev commented 9 years ago

Examples being VT100 escape codes from a terminal or images in PPM format. Such data is highly compressable, with 20x compression rate or more. In some codecs this kind of data will cause strange behavior and trigger edge cases.

Here is an example file: https://www.dropbox.com/s/5gzk7ro4ze7v3xw/testimage.ppm?dl=0

(Just a screencap from a terminal emulator on my machine, nothing that could have licensing issues.)

ivan-tkatchev commented 9 years ago

P.S. This file triggers strange behavior in gzip -- gzip results in a compressed file that's 3.5 times larger than when compressed with bzip2. This has practical implications: after converting the PPM to PNG the file is almost twice as big as when converted to GIF; this really should never happen. (PNG uses deflate under the hood.)