huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Difference between slow and fast GPT2 tokenizers #1363

Closed goerch closed 9 months ago

goerch commented 9 months ago

Please see this comment by @jploski.

ArthurZucker commented 9 months ago

Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env as well. 🤗

jploski commented 9 months ago

Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env as well. 🤗

https://github.com/jploski/llama.cpp/tree/hf-issue-1363/tests/hf-issue-1363

(env output included in README.md)

goerch commented 9 months ago

This comment is explaining the problem. Not sure if I should close this issue then?

ArthurZucker commented 9 months ago

Probably yes! Unless the changes need to be propagated to the slow tokenizer?

goerch commented 9 months ago

Thanks!

cebtenzzre commented 9 months ago

Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)

goerch commented 9 months ago

If only the fast tokenizer is correct, this should be documented somewhere.

I have been thinking about that too, but I accepted that the fast GPT2 tokenizer offers more features than the original one and Falcon used them (unfortunately for us). The remark about the documentation is correct (and I certainly would like to ask a lot more questions about the serialization formats of HF;).

jploski commented 9 months ago

Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)

I agree, these differences in functionality are confusing and would deserve at least a mention in the documentation (https://huggingface.co/docs/transformers/main_classes/tokenizer). I note that for the *TokenizerFast even the tokenizer_file parameter is currently entirely undocumented.

ArthurZucker commented 9 months ago

We are trying to get the same results for fast and slow, but in this specific case Falcon used the GPT2TokenizerFast with a custom template processor. We can add a FalconTokenizer in transformers to be equivalent, but I believe the best we can do is:

Would any of you like to open a PR to update the documentation however you feel? You can ping me for review 🤗