meta-llama / llama

Inference code for Llama models
Other
55.43k stars 9.46k forks source link

Why are ASCII chars in tokenizer? #1000

Open cwallenwein opened 8 months ago

cwallenwein commented 8 months ago

Why are all ASCII characters in the tokenizer file?

"<0x00>": 3,
"<0x01>": 4,
"<0x02>": 5,
"<0x03>": 6,
 ...
"<0xFF>": 258,

For example ASCII 0x31 is actually 1 an in the vocab both tokens exist: "<0x31>": 52, "1": 29896,

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("/output/path")
tokenizer.decode(52) == tokenizer.decode(29896)
> True

If the tokens represent the same char, why keep them twice? Although these are just 256 tokens, the embedding layer still increases in size.

cwallenwein commented 7 months ago

The tokens from index 3 to 258 are not ASCII characters but tokens used for Byte-Fallback. There are 140k+ unicode characters but the vocab size of Llama is just 32k. Therefore, rare unicode chars are represented as a list of UTF-8 encoded bytes.

tokenizer.tokenize("༃")
['▁', '<0xE0>', '<0xBC>', '<0x83>']

According to this website the char is 0xE0 0xBC 0x83 in UTF-8