Incorrect reading of Kanji in the context of certain words

miurahr / pykakasi

Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.

GNU General Public License v3.0

421 stars 54 forks source link

Describe the bug: The Romanization (the 'hepburn' output) or the Kana readings of certain Kanji characters are wrong in the context of a particular word. For example, 上 is うえ [ue] but can also be じょう [jyou], depending on the context, as the example shows below.

Related issue: None

To Reproduce: Run following code with python3. The problem shown here is for Romanization (hepburn) only but the problem with Hiragana can be reproduced by changing the item['hepburn'] to item['hira'].

from pykakasi import kakasi

text = u"知床観光船作業船上に引き揚げ 網走港向かう あす以降陸揚げ"
kakasi = kakasi()
result = kakasi.convert(text)

for item in result:
    if item['orig']!=' ':
        print("{}".format(item['hepburn']), end=' ')
print()

Expected output: shiretoko kankousen sagyousenjou ni hikiage abashirikou mukau asu ikou rikuage

Actual output: shiretoko kankousen sagyousen ueni hiki age abashiri minato muka u asu ikou rikuage

Environment:

OS: Linux (5.10.104-linuxkit) on a Docker container.
Host OS: macOS 12.4
Python 3.9.7
pykakasi version: [v2.2.1, commit #6d1276469f3c70c58dc5c5a0be0c3899adbcaf83 on master, installed through pip]

Test data: Check the code above. Try the Kanji sentence: 知床観光船作業船上に引き揚げ網走港向かうあす以降陸揚げ.

Additional context: None.

kakasidict.utf8:さぎょうせん作業船 kakasidict.utf8:あばしり網走 kakasidict.utf8:あばしりえき網走駅 kakasidict.utf8:むi 向 kakasidict.utf8:むk 向 kakasidict.utf8:むかi 向 kakasidict.utf8:むかt 向 kakasidict.utf8:むかい向 kakasidict.utf8:むく向 kakasidict.utf8:むこu 向 kakasidict.utf8:むかいかぜ向い風 kakasidict.utf8:むかい向かい kakasidict.utf8:みなと港 kakasidict.utf8:こう港

miurahr / pykakasi

Incorrect reading of Kanji in the context of certain words #153