Open dinhanhx opened 11 months ago
Sorry I did not have time to check this. The tokens should be added directly to the vocab, not as special tokens. The trainer should be handling this but it's not supported yet.
It is planned
Both trainers don't support that yet. I'll work on this at some point!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Did not have time again, but yeah it's adding this to the trainers!
Alternative title
How to make a tokenizer behaving similarly to Llama
Background
Llama tokenizer considers byte_fallback tokens not special. When it decodes, it doesn't remove these tokens other than special tokens (unk, pad, bos, eos).
What I am trying to do
I'm trying to create a tokenizer behaving like Llama. However, I am only able to add byte_fallback tokens as special tokens.
Problem
No matter how I tried this line
tokenizer.add_tokens([AddedToken(content=f"<0x{i:02X}>", special=True, normalized=False) for i in range(256)])
with different position in my code (before training, after training) and with different parameters of AddedToken, I still can not achieve Llama's behavior.