Open KanTakahiro opened 1 month ago
Hey! Sorry for the delay, would you like to open a PR for a fix? As long as the outputs are unaffected this would be nice indeed!
Hello! I have just open a PR for this issue. Please check it and told me if there is anything I need to adjust or improve.
System Info
transformers
version: 4.44.0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
In line 383 of this file
src/transformers/models/bert_japanese/tokenization_bert_japanese.py
, the default dictionary is set to beipadic
and I have to installipadic-py
. Butipadic-py
's GitHub page said "You Shouldn't Use This" and recommend using UniDic. However, although I installedunidic-lite
only, transformers still needipadic
. I have to modify the transformers source code to useunidic-lite
. I changed the line 383 ofsrc/transformers/models/bert_japanese/tokenization_bert_japanese.py
:I think the official version should also be updated to use
unidic-lite
for a modern Japanese tokenization.My script:
Expected behavior
Change the default dictionary for Japanese tokenization from
ipadic
tounidic-lite
.