Wrong offsets mapping in XLMRobertaTokenizerFast

nikitakit commented 3 years ago

Environment info

transformers version: 4.2.2
Platform: Linux-4.15.0-124-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.6.0 (False)
Tensorflow version (GPU?): 2.3.1 (False)
Using GPU in script?: False
Using distributed or parallel set-up in script?: False

Who can help

@mfuntowicz @thomwolf

Information

Model I am using (Bert, XLNet ...): XLMRobertaTokenizerFast

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

from transformers import XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast.from_pretrained('xlm-roberta-large')
text = "？……"
tokenized = tokenizer(text, return_offsets_mapping=True)
print('Text:', text)
print('Tokens:', tokenizer.convert_ids_to_tokens(tokenized.input_ids))
print('Mapped to:', [text[start:end] for start, end in tokenized.offset_mapping])

Observed behavior:

Text: ？……
Tokens: ['<s>', '▁?', '......', '</s>']
Mapped to: ['', '？', '？', '']

Expected behavior:

Text: ？……
Tokens: ['<s>', '▁?', '......', '</s>']
Mapped to: ['', '？', '……', '']

Expected behavior

I'm using XLM-R for Chinese text, and I would expect offset mappings to work correctly even in the presence of various Unicode punctuation symbols. It looks like XLM-R tokenizes "？……" as two tokens ('▁?' and '......'), which I would expect to map back to the appropriate locations in the input. Instead, the offset mapping from these tokens is identical.

This example is an ending of an actual sentence in the Chinese Treebank -- I removed the sentence itself because it doesn't matter for reproducing the bug.

n1t0 commented 3 years ago

The wrong alignments are caused by the Precompiled normalizer in tokenizers. This will be fixed in version 0.10.1 of the library.

n1t0 commented 3 years ago

This is fixed for any version of transformers>=4.3.0

huggingface / transformers