huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.62k stars 27.15k forks source link

Wrong offsets mapping in XLMRobertaTokenizerFast #9744

Closed nikitakit closed 3 years ago

nikitakit commented 3 years ago

Environment info

Who can help

@mfuntowicz @thomwolf

Information

Model I am using (Bert, XLNet ...): XLMRobertaTokenizerFast

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained('xlm-roberta-large')
text = "?……"
tokenized = tokenizer(text, return_offsets_mapping=True)
print('Text:', text)
print('Tokens:', tokenizer.convert_ids_to_tokens(tokenized.input_ids))
print('Mapped to:', [text[start:end] for start, end in tokenized.offset_mapping])

Observed behavior:

Text: ?……
Tokens: ['<s>', '▁?', '......', '</s>']
Mapped to: ['', '?', '?', '']

Expected behavior:

Text: ?……
Tokens: ['<s>', '▁?', '......', '</s>']
Mapped to: ['', '?', '……', '']

Expected behavior

I'm using XLM-R for Chinese text, and I would expect offset mappings to work correctly even in the presence of various Unicode punctuation symbols. It looks like XLM-R tokenizes "?……" as two tokens ('▁?' and '......'), which I would expect to map back to the appropriate locations in the input. Instead, the offset mapping from these tokens is identical.

This example is an ending of an actual sentence in the Chinese Treebank -- I removed the sentence itself because it doesn't matter for reproducing the bug.

n1t0 commented 3 years ago

The wrong alignments are caused by the Precompiled normalizer in tokenizers. This will be fixed in version 0.10.1 of the library.

n1t0 commented 3 years ago

This is fixed for any version of transformers>=4.3.0