huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Different behaviour of BPE encoder after update to 0.14.1 #1358

Closed DOGEwbx closed 9 months ago

DOGEwbx commented 9 months ago

Hi, I just update tokenizers from 0.13.2 to 0.14.1 and I found that the tokenizer models whick works on the original version failed after update.

from tokenizers import Tokenizer
test = Tokenizer.from_file("tokenizer.txt")
print(test.encode("hello world").tokens)

output ['h', 'e', 'l', 'l', 'o', 'Ġ', 'w', 'o', 'r', 'l', 'd'] while word hello can be found in the vocab tokenizer.txt

nceeliev commented 9 months ago

same problem

DOGEwbx commented 9 months ago

It turns out that #1335 change the definition of added_token from "an additional part of vocablary" to "some word need to be splited but no special" I used to add all the byte to the vocab as added token which makes the tokenizer split all bytes.

ArthurZucker commented 9 months ago

Indeed, thanks for answering!

ArthurConmy commented 9 months ago

@ArthurZucker this issue created an extremely subtle bug where users on different versions of tokenizers had different behavior for intepretability research. Are there plans to make added_tokens behavior backwards compatible? This change breaks projects that require predictable tokenization and the work around to edit underlying tokenizers is a pain

ArthurZucker commented 8 months ago

Hey! I'll try to look into it, breaking our community usage is not to our liking either! 🤗 I don't think it should be related to the tokens not being special but yeah that should have been BC. I'll dig into this but it's super strange, added tokens are now just tokens that can or cannot be part of the vocab, but of which we want to have control over. Basically we can control stripping, normalization, single word and special (if we skip them or not when decoding)

ArthurConmy commented 8 months ago

Thanks @ArthurZucker ! In this issue I show transformers code that's different on two different tokenizers versions

ArthurConmy commented 8 months ago

@ArthurZucker is there any update on this? When I last checked there were still inconsistencies.

[If you won't revert the change, I would appreciate a pair of eyes on this tokenizer PR as this fixed the problem for me, but I'm not certain that it is consistent with the tokenizer on old transformers/tokenizers version. Thanks in advance!]

ArthurZucker commented 8 months ago

Sorry I was off last week and did not deep dive yet. I don't think we'll revert so I'll have a look at the PR!

ArthurConmy commented 8 months ago

@ArthurZucker I've now isolated the problem (but I don't understand rust so don't understand where things went wrong).

Consider this minimal tokenizer: https://huggingface.co/ArthurConmy/bugtest-tokenizer/blob/main/tokenizer.json

tokenizer.encode("hello") returned [] on old tokenizers versions, and now return [0]. But when I remove the "hello" from the vocab dictionary in the tokenizer.json file, it works fine and returns [0] on both tokenizers versions. So I think that when tokens were added to both the vocab and the added_tokens list in old tokenizers versions, the token got "deleted" from the vocabulary.

I agree that we shouldn't revert! However, I think that you should add a warning so that when users load in a tokenizer that has a token in both the vocabulary and the added tokens, they should be warned

ArthurZucker commented 8 months ago

Sounds good. Note that before, unless you modified the tokenizer.json file, you could not add tokens that were already in the vocab. Did you manually add the token or did you use add_tokens?

ArthurConmy commented 8 months ago

@ArthurZucker yes haha, I think Neel Nanda modified the tokenizer.json by hand.

I think that probably @DOGEwbx @nceeliev did this too, judging by the tokenizer.txt file shared in the OP.

ArthurZucker commented 8 months ago

Awesome, in that case not 100% sure adding a warning will help. I'll try to see if I can open PR on the hub but it's a bit of a hard thing to find! A warning would have helped you ?

ArthurConmy commented 8 months ago

Yep that's right! When loading the tokenizer.json, catching this would be great

DOGEwbx commented 7 months ago

Hey guys, I also met a bug when using byte level bpe and add_tokens #1392 maybe you should also take a look. Thanks a lot! @ArthurZucker