google-deepmind / language_modeling_is_compression

Apache License 2.0
101 stars 14 forks source link

Llama2 tokenisation and implementation #13

Closed francisrmatt closed 1 month ago

francisrmatt commented 6 months ago

Hi,

I am particularly interested in compressing audio signals and I note that in the paper's implementation 16bit signals are compressed into 8bit values and then compressed using an ASCII tokenisation. I am curious how you achieved compression using Llama2 given Llama2's tokenisation is much larger than the ASCII tokenisation proposed.

I am also interested in the exact implementation of how you called different language models whether that was through local downloading or through hugging face and if there could be any guidance regarding this.

Thanks Matt

anianruoss commented 5 months ago

Thank you for your interest in our paper, Matt!

We just feed the ASCII data to the models. Given that the authors of the paper are affiliated with both Google DeepMind and Meta AI, we ran the experiments on our internal setups.

francisrmatt commented 5 months ago

Hi Anian, thanks for your reply. I can understand feeding in ASCII but since you get the probability distribution of ALL the tokens for Llama2 as output, how do you transform those outputs into just ASCII? That is to say, if the arithmetic coding is done over the distribution for individual ASCII characters then how do you deal with the probability distribution returned by Llama which includes many thousands of tokens beyond just ASCII?