meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
22.85k stars 2.4k forks source link

missed double-tab merge opportunities in the tokenizer #227

Open josharian opened 1 month ago

josharian commented 1 month ago

I was playing with the tokenizer, and I noticed some missed merge opportunities.

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(['\t', '\t\t', '-\t\t', '\t\t-'])
{'input_ids': [[128000, 197], [128000, 298], [128000, 12, 298], [128000, 197, 197, 12]], 'attention_mask': [[1, 1], [1, 1], [1, 1, 1], [1, 1, 1, 1]]}

Observe:

This is probably a consequence of the how regex splits, and thus in some sense not a bug...but it is somewhat unfortunate. The sequence \t\t} exhibits the same behavior, and is very common in Go code, so there are lots of missed merges.

josharian commented 1 month ago

(This also reproduces uses tiktoken.)