A Space Always Prefixes The First Token of `xlm-roberta-large` Encoding Results

ciaochiaociao commented 3 years ago

Environment info

transformers version: 3.5.1
Platform: Linux-4.4.0-140-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

xlm-roberta tokenizer @LysandreJik

Information

Model I am using (XLM-Roberta-Large and Roberta-Large):

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('roberta-large')
print(tokenizer.convert_ids_to_tokens(tokenizer.encode('I am good.')))
print(tokenizer.decode(tokenizer.encode('I am good.')))

tokenizer = AutoTokenizer.from_pretrained('roberta-large', use_fast=True)
print(tokenizer.convert_ids_to_tokens(tokenizer.encode('I am good.')))
print(tokenizer.decode(tokenizer.encode('I am good.')))

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large')
print(tokenizer .convert_ids_to_tokens(tokenizer.encode('I am good.')))
print(tokenizer.decode(tokenizer.encode('I am good.')))

tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large', use_fast=True)
print(tokenizer .convert_ids_to_tokens(tokenizer.encode('I am good.')))
print(tokenizer.decode(tokenizer.encode('I am good.')))

This produces

['<s>', 'I', 'Ġam', 'Ġgood', '.', '</s>']
<s>I am good.</s>
['<s>', 'I', 'Ġam', 'Ġgood', '.', '</s>']
<s>I am good.</s>
['<s>', '▁I', '▁am', '▁good', '.', '</s>']  # note ▁I instead of I
<s> I am good.</s>  # note that here is a space between <s> and I
['<s>', '▁I', '▁am', '▁good', '.', '</s>']  # note ▁I instead of I
<s> I am good.</s>  # note that here is a space between <s> and I

Expected behavior

Decoding the encoded sentence should have the same result except the <s> and </s> special tokens as shown similarly in XLM's GitHub README.

['<s>', 'I', 'Ġam', 'Ġgood', '.', '</s>']
<s>I am good.</s>
['<s>', 'I', 'Ġam', 'Ġgood', '.', '</s>']
<s>I am good.</s>
['<s>', 'I', '▁am', '▁good', '.', '</s>']  # I instead of ▁I
<s>I am good.</s>  # no space before I
['<s>', 'I', '▁am', '▁good', '.', '</s>']  # I instead of ▁I
<s>I am good.</s>  # no space before I

LysandreJik commented 3 years ago

Maybe of interest to @SaulLu

SaulLu commented 3 years ago

Thank you for your detailed issue.

I just tested the original XLM-R tokenizer and it seems to me that our tokenization matches well with the one in the repository you mention.

Indeed, by doing (see Google collaboratory notebook) :

import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval()

tokens = xlmr.encode('I am good I am goodI am good.')

print([(xlmr.decode(torch.tensor([token])), token.item()) for token in tokens])

We get:

[('', 0),
 ('I', 87),
 ('am', 444),
 ('good', 4127),
 ('I', 87),
 ('am', 444),
 ('good', 4127),
 ('I', 568),
 ('am', 444),
 ('good', 4127),
 ('.', 5),
 ('', 2)]

The last line of the above code snippet allows us to see which token is associated with which id and in particular to see that the "I" at the beginning of the sentence and the one in the middle share the same id whereas the last "I" appended to another word is associated with a different id.

In the HF framework this means that id=87 -> '▁I' and id=568 -> 'I'. Therefore, I would tend to agree with tokenizer .convert_ids_to_tokens(tokenizer.encode('I am good.'))'s current output.

Does this answer your question? Am I missing something? :slightly_smiling_face:

ciaochiaociao commented 3 years ago

Thank you for your prompt reply and a clear example.

I agree with you that tokenizer.convert_ids_to_tokens(tokenizer.encode('I am good.')) works as xlmr does
What about tokenizer.decode(tokenizer.encode('I am good.'))? It now gives out <s> I am good.</s> (note the space between <s> and I). Shouldn't it be <s>I am good.</s>?

I know ▁ means a leading space in SentencePiece used by XLM-Roberta as in https://github.com/google/sentencepiece#what-is-sentencepiece. In addition, when the sub-word is at the beginning of the sentence, it is also prefixed with ▁ even though it has no leading space, which makes sense because it discerns the sub-word piece continuing the previous one and one that denotes a boundary. But when we want to decode the encoded string, SentencePiece takes out the leading space if the sub-word has ▁ leading space, while in HF's tokenizer, it does not.

The inserted <s> special token also add one before the whole sentence, making the original first sub-word not the first anymore. Maybe special token should be neglected when determining which one is the first sub-word. I know in SentencePiece, it does not add special token <s>, which is in xlmr. xlmr does not decode id=0 to <s> and id=2 to </s> but all to empty string, and so xlmr.decode(xlmr.encode('I am good.')) actually outputs simple I am good.

I know this is just a very minor issue. I am just using decode method and also trying to do some offset calculation. Maybe I should not rely on decode method but the offsets_mapping one.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers