Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task

cosmeowpawlitan commented 3 years ago

This colab notebook implements a token classification input pipeline extending the logic from this hugging example.

The pipeline works fine with most instance in different languages, but unfortunately, the Japanese Kana ligature (a form of abbreviation? I don't know Japanese well) break the alignment of return_offsets_mapping:

Without the try catch block, it riase ValueError: NumPy boolean array indexing assignment cannot assign 88 input values to the 87 output values where the mask is true, example shown here (another colab notebook)

It is clear that the normalizer is the process that break the alignment, as it is observed that tokenizer._tokenizer.normalizer.normalize_str('ヿ') return 'コト'.

One workaround is to include tokenizer._tokenizer.normalizer.normalize_str before the tokenizer preprocessing pipeline, which is also provided in the first colab notebook with the name udposTestDatasetWorkaround.

I guess similar logics should be included inside the tokenizer and the offsets_mapping generation process such that user don't need to include them in their code. But I don't understand the code of tokenizer well that I think I am not able to do this.

p.s. I am using my own dataset building script in the provided example, but the script should be equivalent to the changes made by this update get_datasetis just a simple wrapping for load_dataset and the tokenizer is just XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")

albertvillanova commented 3 years ago

Hi @jerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?

cosmeowpawlitan commented 3 years ago

Hi @jerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?

Oh, I am sorry I would reopen the post on huggingface/transformers

huggingface / datasets

Tokenizer's normalization preprocessor cause misalignment in return_offsets_mapping for tokenizer classification task #2532