Closed DOGEwbx closed 9 months ago
same problem
It turns out that #1335 change the definition of added_token from "an additional part of vocablary" to "some word need to be splited but no special" I used to add all the byte to the vocab as added token which makes the tokenizer split all bytes.
Indeed, thanks for answering!
@ArthurZucker this issue created an extremely subtle bug where users on different versions of tokenizers
had different behavior for intepretability research. Are there plans to make added_tokens
behavior backwards compatible? This change breaks projects that require predictable tokenization and the work around to edit underlying tokenizers is a pain
Hey! I'll try to look into it, breaking our community usage is not to our liking either! 🤗 I don't think it should be related to the tokens not being special but yeah that should have been BC. I'll dig into this but it's super strange, added tokens are now just tokens that can or cannot be part of the vocab, but of which we want to have control over. Basically we can control stripping, normalization, single word and special (if we skip them or not when decoding)
Thanks @ArthurZucker ! In this issue I show transformers
code that's different on two different tokenizers versions
@ArthurZucker is there any update on this? When I last checked there were still inconsistencies.
[If you won't revert the change, I would appreciate a pair of eyes on this tokenizer PR as this fixed the problem for me, but I'm not certain that it is consistent with the tokenizer on old transformers/tokenizers version. Thanks in advance!]
Sorry I was off last week and did not deep dive yet. I don't think we'll revert so I'll have a look at the PR!
@ArthurZucker I've now isolated the problem (but I don't understand rust so don't understand where things went wrong).
Consider this minimal tokenizer: https://huggingface.co/ArthurConmy/bugtest-tokenizer/blob/main/tokenizer.json
tokenizer.encode("hello")
returned []
on old tokenizers versions, and now return [0]
. But when I remove the "hello"
from the vocab dictionary in the tokenizer.json
file, it works fine and returns [0]
on both tokenizers versions. So I think that when tokens were added to both the vocab and the added_tokens list in old tokenizers versions, the token got "deleted" from the vocabulary.
I agree that we shouldn't revert! However, I think that you should add a warning so that when users load in a tokenizer that has a token in both the vocabulary and the added tokens, they should be warned
Sounds good. Note that before, unless you modified the tokenizer.json file, you could not add tokens that were already in the vocab. Did you manually add the token or did you use add_tokens
?
@ArthurZucker yes haha, I think Neel Nanda modified the tokenizer.json
by hand.
I think that probably @DOGEwbx @nceeliev did this too, judging by the tokenizer.txt file shared in the OP.
Awesome, in that case not 100% sure adding a warning will help. I'll try to see if I can open PR on the hub but it's a bit of a hard thing to find! A warning would have helped you ?
Yep that's right! When loading the tokenizer.json
, catching this would be great
Hey guys, I also met a bug when using byte level bpe and add_tokens #1392 maybe you should also take a look. Thanks a lot! @ArthurZucker
Hi, I just update tokenizers from 0.13.2 to 0.14.1 and I found that the tokenizer models whick works on the original version failed after update.
output
['h', 'e', 'l', 'l', 'o', 'Ġ', 'w', 'o', 'r', 'l', 'd']
while word hello can be found in the vocab tokenizer.txt