Closed trianxy closed 1 week ago
Thanks for your input! 😊 convert_tokens_to_string
expects a sequence of tokens passed in which implies that the text has already been tokenized to produce these tokens. This function is useful for making sense of tokens by returning a coherent string / phrase. As you suggested, café
is not a token as it is not in tokenized format. For example you can use it to do: tok.convert_tokens_to_string(tok.tokenize("café"))
, which would pass in the token sequence! Hope this helps!
Thank you @itazap for the support and your time!
I agree on how the function can/should be used.
I just felt that we could (at least) improve the wording in the docstrings of tok.convert_tokens_to_string(...)
and tok.decode(...)
, as mentioned above. I had tripped over these 2 functions and I am afraid others have/will, too.
However, if others don't feel there is a need to change anything, then I am fine if the issue is closed.
Thanks @trianxy , yes it can be more explicitly described that convert_tokens_to_string
should be used on a list of tokens, while decode
on a list of strings. Feel free to open a PR for the docstrings if you'd like! 🤗
Sounds good. I'll look to add a PR soonish
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.44.2Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The expected behavior is mentioned above inside
Reproduction
. The same behavior appears for a few other models, which use similar code togpt2
.Cause
The line
uses
errors="replace"
which will replaceé
with�
sinceé
is not inside the dictionaryself.byte_encoder
.Possible solutions
If others also feel that this is a problem, then there are a few ways to improve this behavior. I can create a PR if you like:
tok.decode()
, similarly to how the librarytiktoken
includes a WARNING that this operation is lossytok.convert_tokens_to_string()
to have a warning that you should ONLY input tokens (!) (i.e. strings which are represented by a specific token_id)tok.convert_tokens_to_string()
to work for all strings, but I am not sure if that's worth it, because this would need to be done at a lot of models. And it might break production code which may rely on the above (wrong?) behavior.