Open KoichiYasuoka opened 3 years ago
Thank you for your comment and for sharing the issue.
I have not noticed this ipadic issue.
Not only tokenization but also vocab.txt (making vocabulary process) would have the problem, namely the vocabulary wrongly has such a long word (it might be tokenized into some words such as '四半期', '連結', '会計', '期間', '末日', '満期', '手形'
).
Is this problem unique for ipadic
?
If so, one solution would be changing the dictionary ipadic
to unidic_lite
or unidic
and we need to pre-train our model with the dictionary again.
Is this problem unique for
ipadic
?
Maybe. At least unidic_lite
does not tokenize them in such a way:
>>> import fugashi,unidic_lite
>>> parser=fugashi.GenericTagger("-d "+unidic_lite.DICDIR).parse
>>> print(parser("四半期連結会計期間末日満期手形"))
四半 シハン シハン 四半 名詞-普通名詞-一般 0,2
期 キ キ 期 名詞-普通名詞-助数詞可能 1
連結 レンケツ レンケツ 連結 名詞-普通名詞-サ変可能 0
会計 カイケー カイケイ 会計 名詞-普通名詞-サ変可能 0
期間 キカン キカン 期間 名詞-普通名詞-一般 1,2
末日 マツジツ マツジツ 末日 名詞-普通名詞-一般 0
満期 マンキ マンキ 満期 名詞-普通名詞-一般 0,1
手形 テガタ テガタ 手形 名詞-普通名詞-一般 0
EOS
>>> print(parser("第3四半期連結会計期間末日満期手形"))
第 ダイ ダイ 第 接頭辞
3 3 3 3 名詞-数詞 0
四半 シハン シハン 四半 名詞-普通名詞-一般 0,2
期 キ キ 期 名詞-普通名詞-助数詞可能 1
連結 レンケツ レンケツ 連結 名詞-普通名詞-サ変可能 0
会計 カイケー カイケイ 会計 名詞-普通名詞-サ変可能 0
期間 キカン キカン 期間 名詞-普通名詞-一般 1,2
末日 マツジツ マツジツ 末日 名詞-普通名詞-一般 0
満期 マンキ マンキ 満期 名詞-普通名詞-一般 0,1
手形 テガタ テガタ 手形 名詞-普通名詞-一般 0
EOS
However, unidic_lite
(or unidic
) is based upon 国語研短単位, which is rather shorter unit of words for the purpose. I think that some longer unit, such as 国語研長単位, is suitable for FinTech. Would you try and make your own tokenizer?
As you mentioned, it seems that subword tokenization from long units like 長単位 would be better than using ipadic
or unidic(_lite)
.
I think it would be better to create a tokenizer, but it's difficult with my current resources...
Hi @retarfi I've just released Japanese-LUW-Tokenizer. It took me about 20 hours to make the tokenizer from 700MB orig.txt
(each UTF-8 sentence in each line) on 1GPU (NVIDIA GeForce RTX 2080):
import unicodedata
from tokenizers import CharBPETokenizer
from transformers import AutoModelForTokenClassification,AutoTokenizer,TokenClassificationPipeline,RemBertTokenizerFast
brt="KoichiYasuoka/bert-base-japanese-luw-upos"
mdl=AutoModelForTokenClassification.from_pretrained(brt)
tkz=AutoTokenizer.from_pretrained(brt)
nlp=TokenClassificationPipeline(model=mdl,tokenizer=tkz,aggregation_strategy="simple",device=0)
with open("orig.txt","r",encoding="utf-8") as f, open("luw.txt","w",encoding="utf-8") as w:
d=[]
for r in f:
if r.strip()!="":
d.append(r.strip())
if len(d)>255:
for s in nlp(d):
print(" ".join(t["word"] for t in s),file=w)
d=[]
if len(d)>0:
for s in nlp(d):
print(" ".join(t["word"] for t in s),file=w)
alp=[c for c in tkz.convert_ids_to_tokens([i for i in range(len(tkz))]) if len(c)==1 and unicodedata.name(c).startswith("CJK")]
pst=tkz.backend_tokenizer.post_processor
tkz=CharBPETokenizer(lowercase=False,unk_token="[UNK]",suffix="")
tkz.normalizer.handle_chinese_chars=False
tkz.post_processor=pst
tkz.train(files=["luw.txt"],vocab_size=250300,min_frequency=2,limit_alphabet=20000,initial_alphabet=alp,special_tokens=["[PAD]","[UNK]","[CLS]","[SEP]","[MASK]","<special0>","<special1>","<special2>","<special3>","<special4>","<special5>","<special6>","<special7>","<special8>","<special9>"],suffix="")
tkz.save("tokenizer.json")
tokenizer=RemBertTokenizerFast(tokenizer_file="tokenizer.json",vocab_file="/dev/null",bos_token="[CLS]",cls_token="[CLS]",unk_token="[UNK]",pad_token="[PAD]",mask_token="[MASK]",sep_token="[SEP]",do_lower_case=False,keep_accents=True)
tokenizer.save_pretrained("Japanese-LUW-Tokenizer")
vocab_size=250300
seems too big but acceptable. See detail in my diary written in Japanese.
Thank you for sharing! I will check it in detail.
Thank you for releasing bert-small-japanese-fin and other Electra models for FinTech. But I've found they tokenize "四半期連結会計期間末日満期手形" in bad way:
This is because of the bug of
ipadic
on 名詞,数 tokenization for 漢字-strings which begin with 漢数字.I recommend you to use another tokenizer than
BertJapaneseTokenizer
+ipadic
. See detail in my diary written in Japanese.