I'm using XLM-R for Chinese text, and I would expect offset mappings to work correctly even in the presence of various Unicode punctuation symbols. It looks like XLM-R tokenizes "?……" as two tokens ('▁?' and '......'), which I would expect to map back to the appropriate locations in the input. Instead, the offset mapping from these tokens is identical.
This example is an ending of an actual sentence in the Chinese Treebank -- I removed the sentence itself because it doesn't matter for reproducing the bug.
Environment info
transformers
version: 4.2.2Who can help
@mfuntowicz @thomwolf
Information
Model I am using (Bert, XLNet ...): XLMRobertaTokenizerFast
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Observed behavior:
Expected behavior:
Expected behavior
I'm using XLM-R for Chinese text, and I would expect offset mappings to work correctly even in the presence of various Unicode punctuation symbols. It looks like XLM-R tokenizes "?……" as two tokens ('▁?' and '......'), which I would expect to map back to the appropriate locations in the input. Instead, the offset mapping from these tokens is identical.
This example is an ending of an actual sentence in the Chinese Treebank -- I removed the sentence itself because it doesn't matter for reproducing the bug.