huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.92k stars 776 forks source link

llama3 tokenizer doesn't round trip #1543

Closed josharian closed 2 months ago

josharian commented 3 months ago
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer("hello !")
{'input_ids': [128000, 15339, 758], 'attention_mask': [1, 1, 1]}
>>> tokenizer.decode([128000, 15339, 758])
'<|begin_of_text|>hello!'

Observe that the input has a space before the ! and the output does not.

josharian commented 3 months ago

This does not reproduce using the upstream llama3 tokenizer.model and tiktoken.

ArthurZucker commented 3 months ago

I think the same issue was mentioned, that is because of the transformers layer's clean_up_tokenization_spaces. See this: https://github.com/huggingface/transformers/issues/31187

ArthurZucker commented 3 months ago

We are gonna deprecate and remove this flag 😉

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.