Open cornzz opened 2 weeks ago
Good question, and that's exactly what we are aiming for. The purpose of using the OpenAI tokenizer is to align the specified compression rate with the actual token consumption when using GPT. For example, if we have the words ["Learn", "about", "Tooooooooooooooookenizer"] and assume a compression rate of 66%, without repeating the word probability, the compressor might handle the word "Tooooooooooooooookenizer," resulting in a token-level compression rate of 2/9 (since "Tooooooooooooooookenizer" consists of 7 tokens), which does not match the 66%. I hope this helps!
@pzs19 Thank you for the response! I understood the need to repeat the probability for each OpenAI token. However I am still not sure if including the ▁
character for each word skews the distribution, as that adds 2 tokens for each word, but not for punctuation characters?
I was wondering, why for the calculation of the threshold the
new_token_probs
array is constructed, why isword_probs
not used?https://github.com/microsoft/LLMLingua/blob/2dbdbd37aef3b4346c2feec0ff8fba7dc3d42171/llmlingua/prompt_compressor.py#L2398-L2404
This way, each token of the compressor model is tokenized using the OpenAI tokenizer and the word probability is repeated for each token the OpenAI tokenizer returns. The tokens of the compressor model look like this (after merging sub-word tokens):
Each word is prefixed with a special
▁
character. The OpenAI tokenizer encodes▁The
into 3 ids, because the special underscore character takes up 2 ids.So effectively the word probability is repeated 3 times for each word.
Why is that?
I wonder how this affects the distribution and therefore the threshold, as the probabilities for all words are repeated additionally for 2 times (at least, longer words are split into more tokens, e.g.
nondiscriminatory
is 5 tokens), while the probabilites for punctuation characters aren't repeated additionally as they are not prefixed with▁
.@pzs19