huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output #1376

Closed BettyFabre closed 7 months ago

BettyFabre commented 8 months ago

Hello, I use a RobertaTokenizer to tokenize sentences that contains french characters like é or ç. I need the generated tokens with the Ġ character and the french characters well formatted.

For instance with the input input = "3 allées paris, 75000"

[tokenizer.decode([token]) for token in tokenizer.encode(input)]
outputs ['<s>', ' 3', ' all', 'ées', ' paris', ',', ' 7', '5000', '</s>'] so the Ġ are replaced by spaces.

And tokenizer.tokenize(input)
outputs ['Ġ3', 'Ġall', 'ées', 'Ġparis', ',', 'Ġ7', '5000'] so the french characters are not well formatted.

I used to do this, and it used to work:

inputs = self.tokenizer.encode_plus(input, return_tensors="pt")
ids = inputs['input_ids'].cpu().tolist()
clean_tokens = [self.tokenizer.decode([token]) for token in ids[0]]

But for some reasons I cannot understand, it does not output the tokens with the Ġ characters anymore and I cannot figure out what was the breaking change. I cannot reproduce with older/different version of torch, tokenizers or transformers libraries.

Do you have any idea ?

ArthurZucker commented 8 months ago

Hey! If you want the tokens and not the string you should use [tokenizer.convert_ids_to_tokens(token) for token in list_token]

BettyFabre commented 8 months ago
input = "3 allées paris"
tokens = tokenizer(input)

[tokenizer.convert_ids_to_tokens(token) for token in tokens['input_ids']]
# ['<s>', 'Ġ3', 'Ġall', 'ées', 'Ġparis', '</s>']

It returns the same as tokenizer.tokenize(input); the tokens are not well formatted.

ArthurZucker commented 8 months ago

Could you share your tokenizer.json file or a repo on the hub + maybe a reproducible snippet? Most breaking changes happened with:

But the roberta tokenizers has not been touched per say

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.