huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.08k stars 27.03k forks source link

Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 #33577

Closed trianxy closed 1 week ago

trianxy commented 1 month ago

System Info

Who can help?

@ArthurZucker @itazap

Information

Tasks

Reproduction

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=False)

print(tok.convert_tokens_to_string(["café"]))  # -> error 1: prints 'caf�' instead of expected `café`

print(tok.encode(tok.decode([113]))[0])  # error 2: prints 4210 (the id of '�') instead of expected 113 (related to error1)

Expected behavior

The expected behavior is mentioned above inside Reproduction. The same behavior appears for a few other models, which use similar code to gpt2.

Cause

The line

        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)

uses errors="replace" which will replace é with since é is not inside the dictionary self.byte_encoder.

Possible solutions

If others also feel that this is a problem, then there are a few ways to improve this behavior. I can create a PR if you like:

  1. Improve the docstring of tok.decode(), similarly to how the library tiktoken includes a WARNING that this operation is lossy
  2. Improve the docstring of tok.convert_tokens_to_string() to have a warning that you should ONLY input tokens (!) (i.e. strings which are represented by a specific token_id)
  3. Finally, I can refactor tok.convert_tokens_to_string() to work for all strings, but I am not sure if that's worth it, because this would need to be done at a lot of models. And it might break production code which may rely on the above (wrong?) behavior.
itazap commented 1 month ago

Thanks for your input! 😊 convert_tokens_to_string expects a sequence of tokens passed in which implies that the text has already been tokenized to produce these tokens. This function is useful for making sense of tokens by returning a coherent string / phrase. As you suggested, café is not a token as it is not in tokenized format. For example you can use it to do: tok.convert_tokens_to_string(tok.tokenize("café")), which would pass in the token sequence! Hope this helps!

trianxy commented 1 month ago

Thank you @itazap for the support and your time!

I agree on how the function can/should be used.

I just felt that we could (at least) improve the wording in the docstrings of tok.convert_tokens_to_string(...) and tok.decode(...), as mentioned above. I had tripped over these 2 functions and I am afraid others have/will, too.

However, if others don't feel there is a need to change anything, then I am fine if the issue is closed.

itazap commented 1 month ago

Thanks @trianxy , yes it can be more explicitly described that convert_tokens_to_string should be used on a list of tokens, while decode on a list of strings. Feel free to open a PR for the docstrings if you'd like! 🤗

trianxy commented 1 month ago

Sounds good. I'll look to add a PR soonish

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.