Open cwallenwein opened 8 months ago
The tokens from index 3 to 258 are not ASCII characters but tokens used for Byte-Fallback. There are 140k+ unicode characters but the vocab size of Llama is just 32k. Therefore, rare unicode chars are represented as a list of UTF-8 encoded bytes.
tokenizer.tokenize("༃")
['▁', '<0xE0>', '<0xBC>', '<0x83>']
According to this website the char ༃
is 0xE0 0xBC 0x83
in UTF-8
Why are all ASCII characters in the tokenizer file?
For example ASCII 0x31 is actually 1 an in the vocab both tokens exist: "<0x31>": 52, "1": 29896,
If the tokens represent the same char, why keep them twice? Although these are just 256 tokens, the embedding layer still increases in size.