Closed tiandiweizun closed 8 months ago
tokenizer=AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf") print(model.tokenize("你是谁")) // ['▁', '你', '是', '<0xE8>', '<0xB0>', '<0x81>'] model.add_tokens("谁") print(model.tokenize("你是谁")) // ['▁', '你', '是', '<0xE8>', '<0xB0>', '<0x81>']
when load in slow mode,it is ok。 // AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf", use_fast=False )
Hey, you should add the token using tokenizer.add_tokens(AddedToken("谁", normalized = False)) should do the trick 😉 You can now check the content of the added tokens using tokenizer.added_tokens_decoder
tokenizer.add_tokens(AddedToken("谁", normalized = False))
tokenizer.added_tokens_decoder
tokenizer=AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf") print(model.tokenize("你是谁")) // ['▁', '你', '是', '<0xE8>', '<0xB0>', '<0x81>'] model.add_tokens("谁") print(model.tokenize("你是谁")) // ['▁', '你', '是', '<0xE8>', '<0xB0>', '<0x81>']
when load in slow mode,it is ok。 // AutoTokenizer.from_pretrained("d:/models/Llama-2-7b-chat-hf", use_fast=False )