Closed GoogleCodeExporter closed 9 years ago
Thank you for this interesting suggestion.
We already computed the sum of all the fractional bits lost with Huffman
coding, and it was 0.8% of the total compressed corpus size. We believe that
this is the maximum that we can win by simply switching from Huffman to
arithmetic coding.
We will study this possibility and report back with our findings.
Original comment by szaba...@google.com
on 6 Jan 2014 at 4:23
We implemented an ANS based entropy coder and used it to replace the Huffman
coder in Brotli.
The compression ratio and decoding speed results are summarized in the attached
study.
Based on that, we concluded that for the specific use-case in Brotli, the
Huffman coder performs better than an ANS based entropy coder.
Original comment by szaba...@google.com
on 29 Jan 2014 at 4:26
Attachments:
Sorry for opening the issue again.
Could you test with relational/tabular data too (like the ones your data
processing teams in Google are dealing with)?
In our tests, relational/tabular data probability distributions oftenly are a
*lot* better matched to arithmetic encoders rather than Huffman based entropy
encoders.
Original comment by est...@gmail.com
on 29 Jan 2014 at 8:38
And i should add, that the probability distributions don't change that much
across the whole relational data streams, so the compression blocks can remain
large (less headers).
(Both this and previous comments imply that you work with columnar blocks like
RCFile does)
Original comment by est...@gmail.com
on 29 Jan 2014 at 8:55
I see the implementation from document is slower than Huffman coding - it
probably can be repaired as both implementations I am aware of are faster or
comparable: linked Yann Collet's, and of Pascal Massimino (skal) from Google
who presents his results here:
https://groups.google.com/a/webmproject.org/forum/#!topic/codec-devel/idezdUoV1y
Y
Regarding sizes of headers contacting probability distributions, in Yann's
implementation these probabilities are adaptively found from already processed
data - there is no cost of storing them.
If you indeed need to store them, there is a question how well we want to
approximate/quantize them - in ANS we could also use approximation by powers of
2 like in Huffman.
So the question is how you quantize/represent the approximated probabilities
and how large blocks do you use - ANS gives much larger space in which you can
optimize these parameters than Huffman - I don't believe that Huffman is in
their global optimum.
Thanks,
Jarek
Original comment by duda...@gmail.com
on 29 Jan 2014 at 10:52
The statement regarding speed of ANS being hampered by construction table is
unexpected.
> When using bigger table sizes for ANS, the decoder gets even slower, because
the cost of initializing the decoding state tables quickly starts to dominate.
Since it's backed by analyzing tools I'll be cautious in my statement.
> A profile of the decoder on the web pages corpus shows that when using 4096
state ANS, 59% of the total decoding time is spent in initializing the state
table.
However, it can be interesting to note that, in my implementation, table
construction is only a small fraction of overall decoding time. There are more
important contributors.
Speaking about other potential contributors, I was wondering about the total
memory cost required by ANS within Brotli. Even if one table has only 1024
states, I'm wondering what's the total number of tables, especially if they are
required simultaneously (something I'm not able to guess from the source for
the time being).
Original comment by yann.col...@gmail.com
on 30 Jan 2014 at 10:22
Regarding header sizes, look at Figure 3 in Yuriy Reznik "Quantization of
discrete probability distributions" (
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=424EA0169E33513605294A8
0B2FF0B30?doi=10.1.1.205.8709&rep=rep1&type=pdf ) - it compares accuracy of
different methods of quantization. Huffman looks terrible there in comparison
to a simple accurate quantization - you could obtain the Huffman rate using a
few times smaller headers, and there is still place for improvements
(discussion:
http://encode.ru/threads/1883-Reducing-the-header-optimal-quantization-compressi
on-of-probability-distribution ).
Much better than Huffman, almost touching simple method for accurate
quantization is Gilber-Moore (
http://alcatel-lucent.com/bstj/vol38-1959/articles/bstj38-4-933.pdf ) - also a
prefix tree, but this time maintaining order of symbols, making it cheaper to
store.
Regarding speed, Yann's implementation has recently received almost 2x speedup
- making it almost 2 faster than zlib Huffman:
https://github.com/Cyan4973/FiniteStateEntropy
Thanks to Fabian Giesen, another variant (rANS) also can obtain similar speed -
it needs less memory and is a bit cheaper for initialization, so might be more
appropriate for you: https://github.com/rygorous/ryg_rans
With best regards,
Jarek
Original comment by duda...@gmail.com
on 28 Feb 2014 at 3:28
Here are current benchmarks of entropy coders (FSE is tANS):
http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders
http://encode.ru/threads/1890-Benchmarking-Entropy-Coders
Another compressor has recently replaced Huffman with ANS, getting improvement
of both compression rate and speed:
http://encode.ru/threads/2017-LzTurbo-The-fastest-now-more-faster
Also, "tuned" symbol spread for tANS was recently introduced - using not only
quantized probabilities, but also the real ones to get better compression
rates: https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
Best,
Jarek
Original comment by duda...@gmail.com
on 8 Aug 2014 at 5:45
Original issue reported on code.google.com by
jrmui...@gmail.com
on 3 Jan 2014 at 2:21