Closed anirbanbasu closed 2 years ago
It may be no dictionary item for 作業船上 and 網走港 in the dictionary that pykakasi used.
向かう
is defined as むk
in the dictionary and match with 向か-う
向こ-う
etc, so, muka
is recognized as word stem.
kakasidict.utf8:さぎょうせん 作業船
kakasidict.utf8:あばしり 網走
kakasidict.utf8:あばしりえき 網走駅
kakasidict.utf8:むi 向
kakasidict.utf8:むk 向
kakasidict.utf8:むかi 向
kakasidict.utf8:むかt 向
kakasidict.utf8:むかい 向
kakasidict.utf8:むく 向
kakasidict.utf8:むこu 向
kakasidict.utf8:むかいかぜ 向い風
kakasidict.utf8:むかい 向かい
kakasidict.utf8:みなと 港
kakasidict.utf8:こう 港
pykakasi put priority for single kanji to be converted into traditional japanese pronounce, minato
over chinese kanji pronounce kou
.
If you need to get a correctness of sentence recognition in Japanese, you are recommended to see modern NLP libraries rather than pykakasi. PyKakasi is designed to be light weight, simple, stupid and low footprint. It does not run actual modern morphological analysis, 形態素解析, but just use vocabulary match with longest-match algorithm.
Describe the bug: The Romanization (the 'hepburn' output) or the Kana readings of certain Kanji characters are wrong in the context of a particular word. For example, 上 is うえ [ue] but can also be じょう [jyou], depending on the context, as the example shows below.
Related issue: None
To Reproduce: Run following code with python3. The problem shown here is for Romanization (hepburn) only but the problem with Hiragana can be reproduced by changing the
item['hepburn']
toitem['hira']
.Expected output:
shiretoko kankousen sagyousenjou ni hikiage abashirikou mukau asu ikou rikuage
Actual output:
shiretoko kankousen sagyousen ueni hiki age abashiri minato muka u asu ikou rikuage
Environment:
pip
]Test data: Check the code above. Try the Kanji sentence: 知床観光船作業船上に引き揚げ 網走港向かう あす以降陸揚げ.
Additional context: None.