miurahr / pykakasi

Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
https://codeberg.org/miurahr/pykakasi
GNU General Public License v3.0
421 stars 54 forks source link

Incorrect reading of Kanji in the context of certain words #153

Closed anirbanbasu closed 2 years ago

anirbanbasu commented 2 years ago

Describe the bug: The Romanization (the 'hepburn' output) or the Kana readings of certain Kanji characters are wrong in the context of a particular word. For example, 上 is うえ [ue] but can also be じょう [jyou], depending on the context, as the example shows below.

Related issue: None

To Reproduce: Run following code with python3. The problem shown here is for Romanization (hepburn) only but the problem with Hiragana can be reproduced by changing the item['hepburn'] to item['hira'].

from pykakasi import kakasi

text = u"知床観光船作業船上に引き揚げ 網走港向かう あす以降陸揚げ"
kakasi = kakasi()
result = kakasi.convert(text)

for item in result:
    if item['orig']!=' ':
        print("{}".format(item['hepburn']), end=' ')
print()

Expected output: shiretoko kankousen sagyousenjou ni hikiage abashirikou mukau asu ikou rikuage

Actual output: shiretoko kankousen sagyousen ueni hiki age abashiri minato muka u asu ikou rikuage

Environment:

Test data: Check the code above. Try the Kanji sentence: 知床観光船作業船上に引き揚げ 網走港向かう あす以降陸揚げ.

Additional context: None.

miurahr commented 2 years ago

It may be no dictionary item for 作業船上 and 網走港 in the dictionary that pykakasi used. 向かう is defined as むk in the dictionary and match with 向か-う 向こ-う etc, so, muka is recognized as word stem.

kakasidict.utf8:さぎょうせん 作業船

kakasidict.utf8:あばしり 網走
kakasidict.utf8:あばしりえき 網走駅

kakasidict.utf8:むi 向
kakasidict.utf8:むk 向
kakasidict.utf8:むかi 向
kakasidict.utf8:むかt 向
kakasidict.utf8:むかい 向
kakasidict.utf8:むく 向
kakasidict.utf8:むこu 向
kakasidict.utf8:むかいかぜ 向い風
kakasidict.utf8:むかい 向かい

kakasidict.utf8:みなと 港
kakasidict.utf8:こう 港

pykakasi put priority for single kanji to be converted into traditional japanese pronounce, minato over chinese kanji pronounce kou.

If you need to get a correctness of sentence recognition in Japanese, you are recommended to see modern NLP libraries rather than pykakasi. PyKakasi is designed to be light weight, simple, stupid and low footprint. It does not run actual modern morphological analysis, 形態素解析, but just use vocabulary match with longest-match algorithm.