Why are ASCII chars in tokenizer?

meta-llama / llama

Inference code for Llama models

Other

55.43k stars 9.46k forks source link

Why are all ASCII characters in the tokenizer file?

"<0x00>": 3,
"<0x01>": 4,
"<0x02>": 5,
"<0x03>": 6,
 ...
"<0xFF>": 258,

For example ASCII 0x31 is actually 1 an in the vocab both tokens exist: "<0x31>": 52, "1": 29896,

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
tokenizer.decode(52) == tokenizer.decode(29896)
> True

If the tokens represent the same char, why keep them twice? Although these are just 256 tokens, the embedding layer still increases in size.

meta-llama / llama

Why are ASCII chars in tokenizer? #1000