Consider using Asymmetric Numeral Systems for entropy coding instead of Huffman

GoogleCodeExporter commented 9 years ago

ANS has compression performance similar to range coding and speed as fast or 
faster than huffman.

There's an implementation of it here:
https://github.com/Cyan4973/FiniteStateEntropy

described in blog posts:
http://fastcompression.blogspot.ca/2013/12/finite-state-entropy-new-breed-of.htm
l

http://fastcompression.blogspot.ca/2013/12/zhuff-v09-or-first-fse-application.ht
ml

Original issue reported on code.google.com by jrmui...@gmail.com on 3 Jan 2014 at 2:21

GoogleCodeExporter commented 9 years ago

Thank you for this interesting suggestion.

We already computed the sum of all the fractional bits lost with Huffman 
coding, and it was 0.8% of the total compressed corpus size. We believe that 
this is the maximum that we can win by simply switching from Huffman to 
arithmetic coding.

We will study this possibility and report back with our findings.

Original comment by szaba...@google.com on 6 Jan 2014 at 4:23

GoogleCodeExporter commented 9 years ago

We implemented an ANS based entropy coder and used it to replace the Huffman 
coder in Brotli.

The compression ratio and decoding speed results are summarized in the attached 
study.

Based on that, we concluded that for the specific use-case in Brotli, the 
Huffman coder performs better than an ANS based entropy coder.

Original comment by szaba...@google.com on 29 Jan 2014 at 4:26

Changed state: Done

Attachments:

ANSforBrotli.pdf

GoogleCodeExporter commented 9 years ago

Sorry for opening the issue again. 

Could you test with relational/tabular data too (like the ones your data 
processing teams in Google are dealing with)? 

In our tests, relational/tabular data probability distributions oftenly are a 
*lot* better matched to arithmetic encoders rather than Huffman based entropy 
encoders.

Original comment by est...@gmail.com on 29 Jan 2014 at 8:38

GoogleCodeExporter commented 9 years ago

And i should add, that the probability distributions don't change that much 
across the whole relational data streams, so the compression blocks can remain 
large (less headers).

(Both this and previous comments imply that you work with columnar blocks like 
RCFile does)

Original comment by est...@gmail.com on 29 Jan 2014 at 8:55

GoogleCodeExporter commented 9 years ago

I see the implementation from document is slower than Huffman coding - it 
probably can be repaired as both implementations I am aware of are faster or 
comparable: linked Yann Collet's, and of Pascal Massimino (skal) from Google 
who presents his results here:
https://groups.google.com/a/webmproject.org/forum/#!topic/codec-devel/idezdUoV1y
Y

Regarding sizes of headers contacting probability distributions, in Yann's 
implementation these probabilities are adaptively found from already processed 
data - there is no cost of storing them.
If you indeed need to store them, there is a question how well we want to 
approximate/quantize them - in ANS we could also use approximation by powers of 
2 like in Huffman.
So the question is how you quantize/represent the approximated probabilities 
and how large blocks do you use - ANS gives much larger space in which you can 
optimize these parameters than Huffman - I don't believe that Huffman is in 
their global optimum.

Thanks,
Jarek

Original comment by duda...@gmail.com on 29 Jan 2014 at 10:52

GoogleCodeExporter commented 9 years ago

The statement regarding speed of ANS being hampered by construction table is 
unexpected.

> When using bigger table sizes for ANS, the decoder gets even slower, because 
the cost of initializing the decoding state tables quickly starts to dominate.

Since it's backed by analyzing tools I'll be cautious in my statement.

> A profile of the decoder on the web pages corpus shows that when using 4096 
state ANS, 59% of the total decoding time is spent in initializing the state 
table.

However, it can be interesting to note that, in my implementation, table 
construction is only a small fraction of overall decoding time. There are more 
important contributors.

Speaking about other potential contributors, I was wondering about the total 
memory cost required by ANS within Brotli. Even if one table has only 1024 
states, I'm wondering what's the total number of tables, especially if they are 
required simultaneously (something I'm not able to guess from the source for 
the time being).

Original comment by yann.col...@gmail.com on 30 Jan 2014 at 10:22

GoogleCodeExporter commented 9 years ago

Regarding header sizes, look at Figure 3 in Yuriy Reznik "Quantization of 
discrete probability distributions" ( 
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=424EA0169E33513605294A8
0B2FF0B30?doi=10.1.1.205.8709&rep=rep1&type=pdf ) - it compares accuracy of 
different methods of quantization. Huffman looks terrible there in comparison 
to a simple accurate quantization - you could obtain the Huffman rate using a 
few times smaller headers, and there is still place for improvements 
(discussion: 
http://encode.ru/threads/1883-Reducing-the-header-optimal-quantization-compressi
on-of-probability-distribution ).
Much better than Huffman, almost touching simple method for accurate 
quantization is Gilber-Moore ( 
http://alcatel-lucent.com/bstj/vol38-1959/articles/bstj38-4-933.pdf ) - also a 
prefix tree, but this time maintaining order of symbols, making it cheaper to 
store.

Regarding speed, Yann's implementation has recently received almost 2x speedup 
- making it almost 2 faster than zlib Huffman: 
https://github.com/Cyan4973/FiniteStateEntropy
Thanks to Fabian Giesen, another variant (rANS) also can obtain similar speed - 
it needs less memory and is a bit cheaper for initialization, so might be more 
appropriate for you: https://github.com/rygorous/ryg_rans

With best regards,
Jarek

Original comment by duda...@gmail.com on 28 Feb 2014 at 3:28

GoogleCodeExporter commented 9 years ago

Here are current benchmarks of entropy coders (FSE is tANS): 
http://encode.ru/threads/1920-In-memory-open-source-benchmark-of-entropy-coders
http://encode.ru/threads/1890-Benchmarking-Entropy-Coders
Another compressor has recently replaced Huffman with ANS, getting improvement 
of both compression rate and speed: 
http://encode.ru/threads/2017-LzTurbo-The-fastest-now-more-faster

Also, "tuned" symbol spread for tANS was recently introduced - using not only 
quantized probabilities, but also the real ones to get better compression 
rates: https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit

Best,
Jarek

Original comment by duda...@gmail.com on 8 Aug 2014 at 5:45

seanjensengrey / font-compression-reference

Consider using Asymmetric Numeral Systems for entropy coding instead of Huffman #1