Closed abgoswam closed 3 years ago
Hi, thanks for opening an issue! Seen with @n1to, and this comes from the Unigram-based tokenizers: Unigram's tokenize cuts the string into tokens, and then converts them to IDs. Unknown tokens are detected during the token to IDs conversion, rather than when the string is cut into tokens.
This is different to BPE, where the string is cut in a lot of independant characters, converted to IDs, then merged to gether.
This is also different to WordPiece, where we start from the word and cut it until we find a token representation for each word piece; if we don't, then that's unknown.
hi @LysandreJik . Thanks for looking into this, and sharing the info
based on your response it seems that for the XLMRobertaTokenizer
tokenizer, we cannot guarantee that the following holds:
assert tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))) == text
am i right ?
I believe that's true for any tokenizer. If the tokenizer cannot tokenize one part of your text as it is not part of your vocabulary, then some information is lost.
Hey guys,
for the sake of completeness, here's the double check with the reference implementation/tokenizer:
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.base')
xlmr.eval()
tokens = xlmr.encode('请在黄鹂餐厅预订今晚7点半的位置。')
It outputs:
tensor([ 0, 6, 9736, 213, 19390, 3, 113638, 209093, 155755,
966, 2391, 6193, 57486, 30, 2])
3 is the id for the unknown token, but you "reverse" tokenization with:
xlmr.decode(tokens)
This outputs:
'请在黄<unk>餐厅预订今晚7点半的位置。'
So the <unk>
token also appears :)
@LysandreJik . agree that for any tokenizer, some information loss might happen, if the token is not part of the vocab.
I guess, SentencePiece
tokenizer is unique in a way : in the sense that
SentencePieceProcessor provides a lossless data conversion that allows the original raw sentence to be perfectly reconstructed from the encoded data, i.e., Decode(Encode(input)) == input.
Because of this, in the tokenize
api for XLMRobertaTokenizer
, there is no <unk>
when the string is being cut into tokens
But, in the encode
api when the tokens are converted to ids, <unk>
are permitted as @stefan-it confirmed.
Thanks folks for the discussion and insight into the behaviour of tokenizers in HF.
Closing this issue, since its not a bug per se.
hi guys.
I try to reproduce the code that is at the beginning of the topic and I get the following:
Environment info
transformers
version: 4.5.0.dev0 (latest master)Who can help
@LysandreJik
Information
XLMRobertaTokenizer
encode_plus
api producing<unk>
for a valid tokenTo reproduce
Expected behavior