Closed BettyFabre closed 7 months ago
Hey! If you want the tokens and not the string you should use [tokenizer.convert_ids_to_tokens(token) for token in list_token]
input = "3 allées paris"
tokens = tokenizer(input)
[tokenizer.convert_ids_to_tokens(token) for token in tokens['input_ids']]
# ['<s>', 'Ġ3', 'Ġall', 'ées', 'Ġparis', '</s>']
It returns the same as tokenizer.tokenize(input)
; the tokens are not well formatted.
Could you share your tokenizer.json
file or a repo on the hub + maybe a reproducible snippet?
Most breaking changes happened with:
But the roberta tokenizers has not been touched per say
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hello, I use a RobertaTokenizer to tokenize sentences that contains french characters like é or ç. I need the generated tokens with the Ġ character and the french characters well formatted.
For instance with the input
input = "3 allées paris, 75000"
[tokenizer.decode([token]) for token in tokenizer.encode(input)]
outputs
['<s>', ' 3', ' all', 'ées', ' paris', ',', ' 7', '5000', '</s>']
so the Ġ are replaced by spaces.And
tokenizer.tokenize(input)
outputs
['Ġ3', 'Ġall', 'ées', 'Ġparis', ',', 'Ġ7', '5000']
so the french characters are not well formatted.I used to do this, and it used to work:
But for some reasons I cannot understand, it does not output the tokens with the Ġ characters anymore and I cannot figure out what was the breaking change. I cannot reproduce with older/different version of torch, tokenizers or transformers libraries.
Do you have any idea ?