Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2

trianxy commented 1 month ago

System Info

transformers version: 4.44.2
Platform: macOS-14.7-arm64-arm-64bit
Python version: 3.11.5
Huggingface_hub version: 0.25.0
Safetensors version: 0.4.5
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.4.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: NO

Who can help?

@ArthurZucker @itazap

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=False)

print(tok.convert_tokens_to_string(["café"]))  # -> error 1: prints 'caf�' instead of expected `café`

print(tok.encode(tok.decode([113]))[0])  # error 2: prints 4210 (the id of '�') instead of expected 113 (related to error1)

Expected behavior

The expected behavior is mentioned above inside Reproduction. The same behavior appears for a few other models, which use similar code to gpt2.

Cause

The line

        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)

uses errors="replace" which will replace é with � since é is not inside the dictionary self.byte_encoder.

Possible solutions

If others also feel that this is a problem, then there are a few ways to improve this behavior. I can create a PR if you like:

Improve the docstring of tok.decode(), similarly to how the library tiktoken includes a WARNING that this operation is lossy
Improve the docstring of tok.convert_tokens_to_string() to have a warning that you should ONLY input tokens (!) (i.e. strings which are represented by a specific token_id)
Finally, I can refactor tok.convert_tokens_to_string() to work for all strings, but I am not sure if that's worth it, because this would need to be done at a lot of models. And it might break production code which may rely on the above (wrong?) behavior.

itazap commented 1 month ago

Thanks for your input! 😊 convert_tokens_to_string expects a sequence of tokens passed in which implies that the text has already been tokenized to produce these tokens. This function is useful for making sense of tokens by returning a coherent string / phrase. As you suggested, café is not a token as it is not in tokenized format. For example you can use it to do: tok.convert_tokens_to_string(tok.tokenize("café")), which would pass in the token sequence! Hope this helps!

trianxy commented 1 month ago

Thank you @itazap for the support and your time!

I agree on how the function can/should be used.

I just felt that we could (at least) improve the wording in the docstrings of tok.convert_tokens_to_string(...) and tok.decode(...), as mentioned above. I had tripped over these 2 functions and I am afraid others have/will, too.

However, if others don't feel there is a need to change anything, then I am fine if the issue is closed.

itazap commented 1 month ago

Thanks @trianxy , yes it can be more explicitly described that convert_tokens_to_string should be used on a list of tokens, while decode on a list of strings. Feel free to open a PR for the docstrings if you'd like! 🤗

trianxy commented 1 month ago

Sounds good. I'll look to add a PR soonish

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers