Closed ciaochiaociao closed 3 years ago
Maybe of interest to @SaulLu
Thank you for your detailed issue.
I just tested the original XLM-R tokenizer and it seems to me that our tokenization matches well with the one in the repository you mention.
Indeed, by doing (see Google collaboratory notebook) :
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval()
tokens = xlmr.encode('I am good I am goodI am good.')
print([(xlmr.decode(torch.tensor([token])), token.item()) for token in tokens])
We get:
[('', 0),
('I', 87),
('am', 444),
('good', 4127),
('I', 87),
('am', 444),
('good', 4127),
('I', 568),
('am', 444),
('good', 4127),
('.', 5),
('', 2)]
The last line of the above code snippet allows us to see which token is associated with which id and in particular to see that the "I" at the beginning of the sentence and the one in the middle share the same id whereas the last "I" appended to another word is associated with a different id.
In the HF framework this means that id=87
-> '▁I'
and id=568
-> 'I'
. Therefore, I would tend to agree with tokenizer .convert_ids_to_tokens(tokenizer.encode('I am good.'))
's current output.
Does this answer your question? Am I missing something? :slightly_smiling_face:
Thank you for your prompt reply and a clear example.
tokenizer.convert_ids_to_tokens(tokenizer.encode('I am good.'))
works as xlmr
doestokenizer.decode(tokenizer.encode('I am good.'))
? It now gives out <s> I am good.</s>
(note the space between <s>
and I
). Shouldn't it be <s>I am good.</s>
?I know ▁
means a leading space in SentencePiece
used by XLM-Roberta as in https://github.com/google/sentencepiece#what-is-sentencepiece. In addition, when the sub-word is at the beginning of the sentence, it is also prefixed with ▁
even though it has no leading space, which makes sense because it discerns the sub-word piece continuing the previous one and one that denotes a boundary. But when we want to decode the encoded string, SentencePiece
takes out the leading space if the sub-word has ▁
leading space, while in HF's tokenizer, it does not.
The inserted
<s>
special token also add one before the whole sentence, making the original first sub-word not the first anymore. Maybe special token should be neglected when determining which one is the first sub-word. I know in SentencePiece, it does not add special token<s>
, which is inxlmr
.xlmr
does not decodeid=0
to<s>
andid=2
to</s>
but all to empty string, and soxlmr.decode(xlmr.encode('I am good.'))
actually outputs simpleI am good
.
I know this is just a very minor issue. I am just using decode
method and also trying to do some offset calculation. Maybe I should not rely on decode
method but the offsets_mapping
one.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 3.5.1Who can help
xlm-roberta tokenizer @LysandreJik
Information
Model I am using (XLM-Roberta-Large and Roberta-Large):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
This produces
Expected behavior
Decoding the encoded sentence should have the same result except the
<s>
and</s>
special tokens as shown similarly in XLM's GitHub README.