Open josharian opened 1 month ago
I was playing with the tokenizer, and I noticed some missed merge opportunities.
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") >>> tokenizer(['\t', '\t\t', '-\t\t', '\t\t-']) {'input_ids': [[128000, 197], [128000, 298], [128000, 12, 298], [128000, 197, 197, 12]], 'attention_mask': [[1, 1], [1, 1], [1, 1, 1], [1, 1, 1, 1]]}
Observe:
This is probably a consequence of the how regex splits, and thus in some sense not a bug...but it is somewhat unfortunate. The sequence \t\t} exhibits the same behavior, and is very common in Go code, so there are lots of missed merges.
\t\t}
(This also reproduces uses tiktoken.)
I was playing with the tokenizer, and I noticed some missed merge opportunities.
Observe:
This is probably a consequence of the how regex splits, and thus in some sense not a bug...but it is somewhat unfortunate. The sequence
\t\t}
exhibits the same behavior, and is very common in Go code, so there are lots of missed merges.