huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132k stars 26.29k forks source link

Use unidic-lite instead of ipadic for Japanese tokenization #32482

Open KanTakahiro opened 1 month ago

KanTakahiro commented 1 month ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

In line 383 of this file src/transformers/models/bert_japanese/tokenization_bert_japanese.py, the default dictionary is set to be ipadic and I have to install ipadic-py. But ipadic-py's GitHub page said "You Shouldn't Use This" and recommend using UniDic. However, although I installed unidic-lite only, transformers still need ipadic. I have to modify the transformers source code to use unidic-lite. I changed the line 383 of src/transformers/models/bert_japanese/tokenization_bert_japanese.py:

- mecab_dic: Optional[str] = "ipadic",
+ mecab_dic: Optional[str] = "unidic-lite",

I think the official version should also be updated to use unidic-lite for a modern Japanese tokenization.

My script:

import random
import glob
from tqdm import tqdm

import torch
from torch.utils.data import DataLoader
from transformers import BertJapaneseTokenizer, BertForSequenceClassification
import pytorch_lightning as pl

MODEL_NAME = 'cl-tohoku/bert-base-japanese-whole-word-masking'

tokenizer = BertJapaneseTokenizer.from_pretrained(MODEL_NAME)
bert_sc = BertForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=10
)
# bert_sc = bert_sc.cuda()

tokenizer = BertJapaneseTokenizer.from_pretrained(MODEL_NAME)

Expected behavior

Change the default dictionary for Japanese tokenization from ipadic to unidic-lite.

ArthurZucker commented 2 weeks ago

Hey! Sorry for the delay, would you like to open a PR for a fix? As long as the outputs are unaffected this would be nice indeed!

KanTakahiro commented 6 days ago

Hello! I have just open a PR for this issue. Please check it and told me if there is anything I need to adjust or improve.