missed double-tab merge opportunities in the tokenizer

I was playing with the tokenizer, and I noticed some missed merge opportunities.

>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer(['\t', '\t\t', '-\t\t', '\t\t-'])
{'input_ids': [[128000, 197], [128000, 298], [128000, 12, 298], [128000, 197, 197, 12]], 'attention_mask': [[1, 1], [1, 1], [1, 1, 1], [1, 1, 1, 1]]}

Observe:

tab is 197
tab tab is 298
but tab tab is not merged when followed by "-"

This is probably a consequence of the how regex splits, and thus in some sense not a bug...but it is somewhat unfortunate. The sequence \t\t} exhibits the same behavior, and is very common in Go code, so there are lots of missed merges.

meta-llama / llama3

missed double-tab merge opportunities in the tokenizer #227