huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

`XLMRobertaTokenizer` `encode_plus` api producing `<unk>` for a valid token #10877

Closed abgoswam closed 3 years ago

abgoswam commented 3 years ago

Environment info

Who can help

@LysandreJik

Information

XLMRobertaTokenizer encode_plus api producing <unk> for a valid token

To reproduce

from transformers import XLMRobertaTokenizer

tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")

text = "请在黄鹂餐厅预订今晚7点半的位置。"
toks = tokenizer.tokenize(text)
assert toks == ['▁', '请', '在', '黄', '鹂', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']

output = tokenizer.encode_plus(text, add_special_tokens=False)
toks_converted = tokenizer.convert_ids_to_tokens(output['input_ids'])

assert toks_converted == ['▁', '请', '在', '黄', '<unk>', '餐厅', '预订', '今晚', '7', '点', '半', '的位置', '。']

Expected behavior

assert toks_converted[4] == '鹂'  # not <unk>
LysandreJik commented 3 years ago

Hi, thanks for opening an issue! Seen with @n1to, and this comes from the Unigram-based tokenizers: Unigram's tokenize cuts the string into tokens, and then converts them to IDs. Unknown tokens are detected during the token to IDs conversion, rather than when the string is cut into tokens.

This is different to BPE, where the string is cut in a lot of independant characters, converted to IDs, then merged to gether.

This is also different to WordPiece, where we start from the word and cut it until we find a token representation for each word piece; if we don't, then that's unknown.

abgoswam commented 3 years ago

hi @LysandreJik . Thanks for looking into this, and sharing the info

based on your response it seems that for the XLMRobertaTokenizer tokenizer, we cannot guarantee that the following holds:

assert tokenizer.decode(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))) == text

am i right ?

LysandreJik commented 3 years ago

I believe that's true for any tokenizer. If the tokenizer cannot tokenize one part of your text as it is not part of your vocabulary, then some information is lost.

stefan-it commented 3 years ago

Hey guys,

for the sake of completeness, here's the double check with the reference implementation/tokenizer:

import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.base')
xlmr.eval()

tokens = xlmr.encode('请在黄鹂餐厅预订今晚7点半的位置。')

It outputs:

tensor([     0,      6,   9736,    213,  19390,      3, 113638, 209093, 155755,
           966,   2391,   6193,  57486,     30,      2])

3 is the id for the unknown token, but you "reverse" tokenization with:

xlmr.decode(tokens)

This outputs:

'请在黄<unk>餐厅预订今晚7点半的位置。'

So the <unk> token also appears :)

abgoswam commented 3 years ago

@LysandreJik . agree that for any tokenizer, some information loss might happen, if the token is not part of the vocab.

I guess, SentencePiece tokenizer is unique in a way : in the sense that

Because of this, in the tokenize api for XLMRobertaTokenizer, there is no <unk> when the string is being cut into tokens

But, in the encode api when the tokens are converted to ids, <unk> are permitted as @stefan-it confirmed.

https://github.com/google/sentencepiece/blob/9cf136582d9cce492ba5a0cfb775f9e777fe07ea/src/unigram_model.cc#L433

abgoswam commented 3 years ago

Thanks folks for the discussion and insight into the behaviour of tokenizers in HF.

Closing this issue, since its not a bug per se.

hapazv commented 3 years ago

hi guys.

I try to reproduce the code that is at the beginning of the topic and I get the following: token roberta