Calculation of threshold in LLMLingua-2

cornzz commented 2 weeks ago

I was wondering, why for the calculation of the threshold the new_token_probs array is constructed, why is word_probs not used?

https://github.com/microsoft/LLMLingua/blob/2dbdbd37aef3b4346c2feec0ff8fba7dc3d42171/llmlingua/prompt_compressor.py#L2398-L2404

This way, each token of the compressor model is tokenized using the OpenAI tokenizer and the word probability is repeated for each token the OpenAI tokenizer returns. The tokens of the compressor model look like this (after merging sub-word tokens):

['▁The', '▁report', '▁of', '▁the', '▁Civil', '▁Rights', ',', '▁Utilities', ',', '▁Economic', '▁Development', '▁and', '▁Arts', '▁Committee', ...]

Each word is prefixed with a special ▁ character. The OpenAI tokenizer encodes ▁The into 3 ids, because the special underscore character takes up 2 ids.

self.oai_tokenizer.encode('▁The')
[10634, 223, 791]

So effectively the word probability is repeated 3 times for each word.

Why is that?

I wonder how this affects the distribution and therefore the threshold, as the probabilities for all words are repeated additionally for 2 times (at least, longer words are split into more tokens, e.g. nondiscriminatory is 5 tokens), while the probabilites for punctuation characters aren't repeated additionally as they are not prefixed with ▁.

@pzs19

pzs19 commented 2 weeks ago

Good question, and that's exactly what we are aiming for. The purpose of using the OpenAI tokenizer is to align the specified compression rate with the actual token consumption when using GPT. For example, if we have the words ["Learn", "about", "Tooooooooooooooookenizer"] and assume a compression rate of 66%, without repeating the word probability, the compressor might handle the word "Tooooooooooooooookenizer," resulting in a token-level compression rate of 2/9 (since "Tooooooooooooooookenizer" consists of 7 tokens), which does not match the 66%. I hope this helps!

cornzz commented 2 weeks ago

@pzs19 Thank you for the response! I understood the need to repeat the probability for each OpenAI token. However I am still not sure if including the ▁ character for each word skews the distribution, as that adds 2 tokens for each word, but not for punctuation characters?

microsoft / LLMLingua

Calculation of threshold in LLMLingua-2 #194