Open JulienVig opened 2 weeks ago
Hi there 👋 can you reproduce this with the python transformers library?
Hi! I actually get consistent results with the python tokenizers:
tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print(tokenizer.encode("\n").ids) # [1, 29871, 13] same as Xenova/llama-tokenizer
tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
print(tokenizer.encode("\n").ids) # [128000, 198] Xenova/llama-3-tokenizer doesn't include special token by default
print(tokenizer.encode("\n", add_special_tokens=False).ids) # [198] same as Xenova/llama-3-tokenizer
So both javascript and python yield different tokenization from the playgrounds. Am I comparing different tokenization settings?
System Info
TypeScript 5.5.4 transformers.js 3.0.2 Node.js v20.170
Environment/Platform
Description
Using some pretrained tokenizers doesn't yield the same tokenization of
"\n"
or" \n"
as Tiktokenizer or Xenova's playground.For example,
Xenova/llama-3-tokenizer
tokenizes"\n"
as[198]
and" \n"
as[720]
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77]Similarly for Llama 2,
Xenova/llama-tokenizer
tokenizes "\n" as[1, 29871, 13]
while Xenova's playground yields[1, 320, 29876]
.Reproduction
Similar issue with
Xenova/llama-tokenizer
.