google / brotli

Brotli compression format
MIT License
13.35k stars 1.23k forks source link

Strange compression ratio on large CSV file... #1071

Open sebres opened 10 months ago

sebres commented 10 months ago

It is rather a question as an issue basically, but since the repository doesn't have discussions (why actually not?) I post it here...

Testing different compression levels on a large CSV files (with v. 1.1.0 and 1.0.9), I noticed a bit strange behavior with ratio on certain levels. Here is an example on public file (downloadable from here, diff used for highlighting purposes only):

  $ brotli -3fcv "1500000 Sales Records.csv" > /dev/null
  Compressed [1500000 Sales Records.csv]: 178.511 MiB -> 58.373 MiB in 3.05 sec

  $ brotli -4fcv "1500000 Sales Records.csv" > /dev/null
+ Compressed [1500000 Sales Records.csv]: 178.511 MiB -> 45.869 MiB in 5.55 sec

  $ brotli -5fcv "1500000 Sales Records.csv" > /dev/null
- Compressed [1500000 Sales Records.csv]: 178.511 MiB -> 49.595 MiB in 10.15 sec

  $ brotli -6fcv "1500000 Sales Records.csv" > /dev/null
- Compressed [1500000 Sales Records.csv]: 178.511 MiB -> 46.090 MiB in 15.79 sec

  $ brotli -7fcv "1500000 Sales Records.csv" > /dev/null
  Compressed [1500000 Sales Records.csv]: 178.511 MiB -> 41.962 MiB in 27.22 sec

The times looks OK, but the ratio by compression with level 4 marches to a different drummer, in fact it does that drastically and it is better than level 5 but even than level 6 too.

The question is what could cause this? And probably how we could avoid that basically slower levels 5/6 start to generate larger output or somehow heuristic force the algorithm "switch" of levels by similar circumstances?

eustas commented 10 months ago

Hmm, that is interesting. Indeed, different quality settings use different approaches in searching "back-refs". Thanks for the heads-up, I'll investigate this case when I have spare cycles.

yota-code commented 1 week ago

I can contribute another case, where -q 2 compression is better than -q3, with the tarball of the sources of mathjax (as found here: https://github.com/mathjax/MathJax/archive/refs/tags/3.2.2.tar.gz, only ungzipped)

mathjax_v322

This test was part of a comparison between zstd and brotli, and brotli steals the pareto front on almost all cases :+1:

sebres commented 6 days ago

I can confirm it...

However the picture is really untypical (something is specific in this source code). I still didn't analyze it, but it looks weird... Because typical picture, as regards the compression of source code generally, looks similar this (left side), for comparison also MathJax-3.2.2.tar (right side):

git-2.45.2.tar MathJax-3.2.2.tar
image image

* brotli, levels 1..11; zstd and zstd-mt (multi-threaded), levels 1..22

I'm not about the winner or pareto efficiency (because it depends), but about "linearity" of the chart lines: even with logarithmic y-axis the right chart looks still "exponential" a bit, or else seems to have a worse (polynomial) complexity (e. g. O(nx) or even direction O(2n)) for levels larger than 4 (brotli) or 9 (zstd); or vice versa - the best (linear) complexity for levels smaller or equal than those (with certain anomality at level 2 or 3 by brotli). Likewise, the "jump" between 8 and 9 is also striking by zstd (for MathJax sources), but hardly noticeable (for git sources).

This looks rather like an outlier to me, however surely worth to analyze.

sebres commented 6 days ago
To illustrate the mentioned striking feature here the same charts, but using linear scales and without extremely fast and extremely slow levels (to exclude part affected by discrepancy due to measure or init overheads for short values as well as the part with large times by ultra compression): git sources MathJax sources
image image

* brotli, levels 2..9; zstd and zstd-mt (multi-threaded), levels 4..15

The "anomality" is pretty obvious here.

And if I take a closer look at the tarball, I think I know why - it is full with compressed/packed JS-code, for instance mhchemParser, so it is not the source code in direct sense, but rather mix of text and packed binary. It looks like both (brotli and zstd) get "confused" by this kind of data.