retarfi / language-pretraining

Pre-training Language Models for Japanese
MIT License
48 stars 5 forks source link

ipadic problem for 四半期連結会計期間末日満期手形 #1

Open KoichiYasuoka opened 3 years ago

KoichiYasuoka commented 3 years ago

Thank you for releasing bert-small-japanese-fin and other Electra models for FinTech. But I've found they tokenize "四半期連結会計期間末日満期手形" in bad way:

>>> from transformers import AutoTokenizer
>>> tokenizer=AutoTokenizer.from_pretrained("izumi-lab/bert-small-japanese-fin")
>>> tokenizer.tokenize("四半期連結会計期間末日満期手形")
['四半期', '連結', '会計', '期間', '末日', '満期', '手形']
>>> tokenizer.tokenize("第3四半期連結会計期間末日満期手形")
['第', '3', '四半期連結会計期間末日満期手形']

This is because of the bug of ipadic on 名詞,数 tokenization for 漢字-strings which begin with 漢数字.

>>> import fugashi,ipadic
>>> parser=fugashi.GenericTagger(ipadic.MECAB_ARGS).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半期  名詞,一般,*,*,*,*,四半期,シハンキ,シハンキ
連結    名詞,サ変接続,*,*,*,*,連結,レンケツ,レンケツ
会計    名詞,サ変接続,*,*,*,*,会計,カイケイ,カイケイ
期間    名詞,一般,*,*,*,*,期間,キカン,キカン
末日    名詞,一般,*,*,*,*,末日,マツジツ,マツジツ
満期    名詞,一般,*,*,*,*,満期,マンキ,マンキ
手形    名詞,一般,*,*,*,*,手形,テガタ,テガタ
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      接頭詞,数接続,*,*,*,*,第,ダイ,ダイ
3       名詞,数,*,*,*,*,*
四半期連結会計期間末日満期手形  名詞,数,*,*,*,*,*
EOS

I recommend you to use another tokenizer than BertJapaneseTokenizer+ipadic. See detail in my diary written in Japanese.

retarfi commented 3 years ago

Thank you for your comment and for sharing the issue. I have not noticed this ipadic issue. Not only tokenization but also vocab.txt (making vocabulary process) would have the problem, namely the vocabulary wrongly has such a long word (it might be tokenized into some words such as '四半期', '連結', '会計', '期間', '末日', '満期', '手形'). Is this problem unique for ipadic ? If so, one solution would be changing the dictionary ipadic to unidic_lite or unidic and we need to pre-train our model with the dictionary again.

KoichiYasuoka commented 3 years ago

Is this problem unique for ipadic?

Maybe. At least unidic_lite does not tokenize them in such a way:

>>> import fugashi,unidic_lite
>>> parser=fugashi.GenericTagger("-d "+unidic_lite.DICDIR).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半    シハン  シハン  四半    名詞-普通名詞-一般                      0,2
期      キ      キ      期      名詞-普通名詞-助数詞可能                       1
連結    レンケツ        レンケツ        連結    名詞-普通名詞-サ変可能         0
会計    カイケー        カイケイ        会計    名詞-普通名詞-サ変可能         0
期間    キカン  キカン  期間    名詞-普通名詞-一般                      1,2
末日    マツジツ        マツジツ        末日    名詞-普通名詞-一般             0
満期    マンキ  マンキ  満期    名詞-普通名詞-一般                      0,1
手形    テガタ  テガタ  手形    名詞-普通名詞-一般                      0
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第      ダイ    ダイ    第      接頭辞
3       3       3       3       名詞-数詞                       0
四半    シハン  シハン  四半    名詞-普通名詞-一般                      0,2
期      キ      キ      期      名詞-普通名詞-助数詞可能                       1
連結    レンケツ        レンケツ        連結    名詞-普通名詞-サ変可能         0
会計    カイケー        カイケイ        会計    名詞-普通名詞-サ変可能         0
期間    キカン  キカン  期間    名詞-普通名詞-一般                      1,2
末日    マツジツ        マツジツ        末日    名詞-普通名詞-一般             0
満期    マンキ  マンキ  満期    名詞-普通名詞-一般                      0,1
手形    テガタ  テガタ  手形    名詞-普通名詞-一般                      0
EOS

However, unidic_lite (or unidic) is based upon 国語研短単位, which is rather shorter unit of words for the purpose. I think that some longer unit, such as 国語研長単位, is suitable for FinTech. Would you try and make your own tokenizer?

retarfi commented 3 years ago

As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using ipadic or unidic(_lite). I think it would be better to create a tokenizer, but it's difficult with my current resources...

KoichiYasuoka commented 3 years ago

Hi @retarfi I've just released Japanese-LUW-Tokenizer. It took me about 20 hours to make the tokenizer from 700MB orig.txt (each UTF-8 sentence in each line) on 1GPU (NVIDIA GeForce RTX 2080):

import unicodedata
from tokenizers import CharBPETokenizer
from transformers import AutoModelForTokenClassification,AutoTokenizer,TokenClassificationPipeline,RemBertTokenizerFast
brt="KoichiYasuoka/bert-base-japanese-luw-upos"
mdl=AutoModelForTokenClassification.from_pretrained(brt)
tkz=AutoTokenizer.from_pretrained(brt)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,aggregation_strategy="simple",device=0)
with open("orig.txt","r",encoding="utf-8") as f, open("luw.txt","w",encoding="utf-8") as w:
  d=[]
  for r in f:
    if r.strip()!="":
      d.append(r.strip())
    if len(d)>255:
      for s in nlp(d):
        print(" ".join(t["word"] for t in s),file=w)
      d=[]
  if len(d)>0:
    for s in nlp(d):
      print(" ".join(t["word"] for t in s),file=w)

alp=[c for c in tkz.convert_ids_to_tokens([i for i in range(len(tkz))]) if len(c)==1 and unicodedata.name(c).startswith("CJK")]
pst=tkz.backend_tokenizer.post_processor
tkz=CharBPETokenizer(lowercase=False,unk_token="[UNK]",suffix="")
tkz.normalizer.handle_chinese_chars=False
tkz.post_processor=pst
tkz.train(files=["luw.txt"],vocab_size=250300,min_frequency=2,limit_alphabet=20000,initial_alphabet=alp,special_tokens=["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]","<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"],suffix="")
tkz.save("tokenizer.json")
tokenizer=RemBertTokenizerFast(tokenizer_file="tokenizer.json",vocab_file="/dev/null",bos_token="[CLS]",cls_token="[CLS]",unk_token="[UNK]",pad_token="[PAD]",mask_token="[MASK]",sep_token="[SEP]",do_lower_case=False,keep_accents=True)
tokenizer.save_pretrained("Japanese-LUW-Tokenizer")

vocab_size=250300 seems too big but acceptable. See detail in my diary written in Japanese.

retarfi commented 2 years ago

Thank you for sharing! I will check it in detail.