Closed cosmeowpawlitan closed 3 years ago
Hi @jerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?
Hi @jerryIsHere, thanks for reporting the issue. But are you sure this is a bug in HuggingFace Datasets?
Oh, I am sorry I would reopen the post on huggingface/transformers
This colab notebook implements a token classification input pipeline extending the logic from this hugging example.
The pipeline works fine with most instance in different languages, but unfortunately, the Japanese Kana ligature (a form of abbreviation? I don't know Japanese well) break the alignment of
return_offsets_mapping
:Without the try catch block, it riase
ValueError: NumPy boolean array indexing assignment cannot assign 88 input values to the 87 output values where the mask is true
, example shown here (another colab notebook)It is clear that the normalizer is the process that break the alignment, as it is observed that
tokenizer._tokenizer.normalizer.normalize_str('ヿ')
return 'コト'.One workaround is to include
tokenizer._tokenizer.normalizer.normalize_str
before the tokenizer preprocessing pipeline, which is also provided in the first colab notebook with the nameudposTestDatasetWorkaround
.I guess similar logics should be included inside the tokenizer and the offsets_mapping generation process such that user don't need to include them in their code. But I don't understand the code of tokenizer well that I think I am not able to do this.
p.s. I am using my own dataset building script in the provided example, but the script should be equivalent to the changes made by this update
get_dataset
is just a simple wrapping forload_dataset
and thetokenizer
is justXLMRobertaTokenizerFast.from_pretrained("xlm-roberta-large")