miurahr / pykakasi

Lightweight converter from Japanese Kana-kanji sentences into Kana-Roman.
https://codeberg.org/miurahr/pykakasi
GNU General Public License v3.0
421 stars 54 forks source link

276 "kanji" are not converted if the input text has same/similar looking hanzi mixed in, also the converter does not complain. #119

Closed BarnabasSzabolcs closed 3 years ago

BarnabasSzabolcs commented 3 years ago

Describe the bug First of all, thanks for having this project! Without Your work I could not do my project probably at all.

The issue: Some Japanese probably accidentally typed the same looking Chinese variant of kanji, or use simplified Chinese charcters mixed in, or some CJK unification conversion happened. The issue is: 見 and 見 are not the same in unicode. In the full list I've sent, some characters are clearly the simplified chinese versions of the kanji characters, however, these characters should be converted just the same I think.

Related issue (if exist)

To Reproduce Steps to reproduce the behavior: (example) I use the following code to convert Japanese text to romaji:

kakasi = pykakasi.kakasi()
kakasi.setMode("H", "a")  # Hiragana to ascii, default: no conversion
kakasi.setMode("K", "a")  # Katakana to ascii, default: no conversion
kakasi.setMode("J", "a")  # Japanese to ascii, default: no conversion
kakasi.setMode("r", "Hepburn")  # default: use Hepburn Roman table
kakasi.setMode("s", True)  # add space, default: no separator
kakasi.setMode("C", False)  # no capitalization
kakasi.getConverter().do(text)
text = "⽢⾃々〻ゞ业东丝丢两丨为丽么乐习书产亿们众优伙伟传伤你侧俱值內兰关兴兹军冻击别剧办动劳卖卡卢厉厌发变吗吧启呃员呜呢响哎哟唸啊啦喂喔嗎嗫嗯团场增处备头夺奶她妆妈妳姬实对寻尔带应废开张强怀态总恶战戾护报拋拥择损捥搔搞敌斩时晚暧极查标样欢步歲每污沟淚渴溫满灵热爱爸爹狱玛环现盘矿确离种竞笔类紧緖红约级纯纸线细终绊经结给绝续绮缔缚缠缲罗职胜脋脫舰艳蓝蔷薰虽蟬补见觉說计认让议记许讹诂诉词诛话该详语说诵请诺谁谈谍谛谢负贯贵贷费贽赶跃踠踩踬轨轮轻辉边达过运还这进远连选遗銮錬针钱铛银错镜镮长门闭问间闻阁队阳险隐难预领颗颜风飞饭马驶驾验骗骷髅髙鲜鸟鸠黑金北葉立切行見"
assert(kakasi.getConverter().do(text).replace(' ', '') == text) # it is true

Expected behavior The characters converted to latin letters.

Environment (please complete the following information):

Test data(please attach in the report): A minimum test data to reproduce your problem.

Additional context Add any other context about the problem here.

miurahr commented 3 years ago

You cannot convert Chinese standard kanji.

BarnabasSzabolcs commented 3 years ago

I have found some info on how to fix this issue: https://stackoverflow.com/a/20843126/1031191

miurahr commented 3 years ago

You may want to use Unihandecode that can handle Chinese. https://github.com/miurahr/unihandecode