Closed goerch closed 9 months ago
Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env
as well. 🤗
Hey! Could you share a full reproducer? Would help me a lot to have the output of
transformers-cli env
as well. 🤗
https://github.com/jploski/llama.cpp/tree/hf-issue-1363/tests/hf-issue-1363
(env output included in README.md)
This comment is explaining the problem. Not sure if I should close this issue then?
Probably yes! Unless the changes need to be propagated to the slow tokenizer?
Thanks!
Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)
If only the fast tokenizer is correct, this should be documented somewhere.
I have been thinking about that too, but I accepted that the fast GPT2 tokenizer offers more features than the original one and Falcon used them (unfortunately for us). The remark about the documentation is correct (and I certainly would like to ask a lot more questions about the serialization formats of HF;).
Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)
I agree, these differences in functionality are confusing and would deserve at least a mention in the documentation (https://huggingface.co/docs/transformers/main_classes/tokenizer). I note that for the *TokenizerFast even the tokenizer_file parameter is currently entirely undocumented.
We are trying to get the same results for fast and slow, but in this specific case Falcon used the GPT2TokenizerFast with a custom template processor. We can add a FalconTokenizer
in transformers to be equivalent, but I believe the best we can do is:
Would any of you like to open a PR to update the documentation however you feel? You can ping me for review 🤗
Please see this comment by @jploski.