add_tokens has no effect in llama fast tokenizer

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.69k stars 747 forks source link

add_tokens has no effect in llama fast tokenizer #1374

Closed tiandiweizun closed 8 months ago

tiandiweizun commented 8 months ago

tokenizer=AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf") print(model.tokenize("你是谁")) // ['▁', '你', '是', '<0xE8>', '<0xB0>', '<0x81>'] model.add_tokens("谁") print(model.tokenize("你是谁")) // ['▁', '你', '是', '<0xE8>', '<0xB0>', '<0x81>']

when load in slow mode，it is ok。 // AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf", use_fast=False )

ArthurZucker commented 8 months ago

Hey, you should add the token using tokenizer.add_tokens(AddedToken("谁", normalized = False)) should do the trick 😉 You can now check the content of the added tokens using tokenizer.added_tokens_decoder