Open sebres opened 1 year ago
Hmm, that is interesting. Indeed, different quality settings use different approaches in searching "back-refs". Thanks for the heads-up, I'll investigate this case when I have spare cycles.
I can contribute another case, where -q 2 compression is better than -q3, with the tarball of the sources of mathjax (as found here: https://github.com/mathjax/MathJax/archive/refs/tags/3.2.2.tar.gz, only ungzipped)
This test was part of a comparison between zstd and brotli, and brotli steals the pareto front on almost all cases :+1:
I can confirm it...
However the picture is really untypical (something is specific in this source code). I still didn't analyze it, but it looks weird... Because typical picture, as regards the compression of source code generally, looks similar this (left side), for comparison also MathJax-3.2.2.tar (right side):
git-2.45.2.tar | MathJax-3.2.2.tar |
---|---|
* brotli, levels 1..11; zstd and zstd-mt (multi-threaded), levels 1..22
I'm not about the winner or pareto efficiency (because it depends), but about "linearity" of the chart lines: even with logarithmic y-axis the right chart looks still "exponential" a bit, or else seems to have a worse (polynomial) complexity (e. g. O(nx)
or even direction O(2n)
) for levels larger than 4 (brotli) or 9 (zstd); or vice versa - the best (linear) complexity for levels smaller or equal than those (with certain anomality at level 2 or 3 by brotli). Likewise, the "jump" between 8 and 9 is also striking by zstd (for MathJax sources), but hardly noticeable (for git sources).
This looks rather like an outlier to me, however surely worth to analyze.
To illustrate the mentioned striking feature here the same charts, but using linear scales and without extremely fast and extremely slow levels (to exclude part affected by discrepancy due to measure or init overheads for short values as well as the part with large times by ultra compression): git sources | MathJax sources |
---|---|
* brotli, levels 2..9; zstd and zstd-mt (multi-threaded), levels 4..15
The "anomality" is pretty obvious here.
And if I take a closer look at the tarball, I think I know why - it is full with compressed/packed JS-code, for instance mhchemParser, so it is not the source code in direct sense, but rather mix of text and packed binary. It looks like both (brotli and zstd) get "confused" by this kind of data.
It is rather a question as an issue basically, but since the repository doesn't have discussions (why actually not?) I post it here...
Testing different compression levels on a large CSV files (with v. 1.1.0 and 1.0.9), I noticed a bit strange behavior with ratio on certain levels. Here is an example on public file (downloadable from here, diff used for highlighting purposes only):
The times looks OK, but the ratio by compression with level 4 marches to a different drummer, in fact it does that drastically and it is better than level 5 but even than level 6 too.
The question is what could cause this? And probably how we could avoid that basically slower levels 5/6 start to generate larger output or somehow heuristic force the algorithm "switch" of levels by similar circumstances?