speechmatics / hqa

Code to accompany the paper "Hierarchical Quantized Autoencoders"
MIT License
37 stars 4 forks source link

understanding compression rate #10

Closed amarzullo24 closed 2 years ago

amarzullo24 commented 2 years ago

Hi! I have a question regarding how the quantized representation can be compressed in practice. In particular, I understand that entropy is used as an estimation of the "size" the compressed file should have. But how such file can be obtained at inference time? I have seen related work using, for example, lossless entropy coding algorithms or huffman coding, among others.

Thanks for your help!

jplhughes commented 2 years ago

Hi @emmeduz, thanks for your question. So firstly, entropy is theoretically the maximum compression rate for lossless compression we can get for a probability distribution. Lossless means we can reconstruct the signal exactly at the receiving end after the compressed bits have been transmitted. HQA is a lossy compression scheme which is why our reconstructions are often very different from the original. However, the compression rates we achieve using our scheme are much more extreme than that given by the entropy. We calculate the compression rate by comparing the total number of bits in the image to the total number of codes that cover the latent space at the final layer. Also remember that if we were to transmit the codes given by HQA, we would just transmit the codebook ids and not the codebook embeddings (so we are assuming there is a codebook and decoder at the receiving end of the transmission so they can reconstruct the signal). Does that answer your question on how the quantised representation is compressed in practise?

amarzullo24 commented 2 years ago

Thanks for your answer! To summarise: imagine we want to transmit the image to a remote site (assuming both sender and receiver know the codebook). A possible pipeline would be:

  1. encode the image and compute the codebook ids
  2. compress the codebook ids (using whatever lossless compression algorithm, e.g. gzip) and send the compressed ids to the remote site
  3. at remote site, use the ids to query the codebook and reconstruct the image

Therefore, the "actual" size of the compressed image to be transmitted would be the size of the zip file containing the ids. Does it make sense?

jplhughes commented 2 years ago

Yes exactly you can think of it like that. Although we don't compress the ids with something like gzip. You can work out the exact number of bits you need to represent your codebook e.g. log2(codebook_size). From that point you are just transmitting 0s and 1s which you can't compress further with gzip AFAIK. Then at decoder side you convert your stream of 0s and 1s to ids, then convert your ids to the codebook embedding, then use the HQA decoder to reconstruct the image.

amarzullo24 commented 2 years ago

That was helpful, thank you!