nicolas-raoul / jakaroma

Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
Apache License 2.0
61 stars 8 forks source link

一匹 is translated to "一Hiki" #7

Open ScoreUnder opened 6 years ago

ScoreUnder commented 6 years ago

As the title says.

I've been looking for a library to break kanji down into their readings (preferably hiragana), and my first test with them is to see how they fare with the 〜匹 counters.

For reference, this is the expected output for the first 10:

漢字 ひらがな Ro-maji
一匹 いっぴき Ippiki
二匹 にひき Nihiki
三匹 さんびき Sanbiki
四匹 よんひき Yonhiki
五匹 ごひき Gohiki
六匹 ろっぴき Roppiki
七匹 ななひき Nanahiki
八匹 はっぴき Happiki
九匹 きゅうひき Kyuuhiki
十匹 じゅっぴき Juppiki

However, this is the output the program creates:

% ./jakaroma.sh '一匹 二匹 三匹 四匹 五匹 六匹 七匹 八匹 九匹 十匹 1匹 2匹 3匹 4匹 5匹 6匹 7匹 8匹 9匹 10匹'
一Hiki  二Hiki  三Hiki  四Hiki  五Hiki  六Hiki  七Hiki  八Hiki  九Hiki  十Hiki  1Hiki  2Hiki  3Hiki  4Hiki  5Hiki  6Hiki  7Hiki  8Hiki  9Hiki  10Hiki 
nicolas-raoul commented 6 years ago

Thanks for the detailed feedback! Do you know whether the same problem appears in Kuromoji?

On the other side, I can imagine cases where someone would prefer 一丁目 to be translated to 1chome rather than Icchome.

Another problem is that the program outputs kanjis (such as 五Hiki in your example), I am not sure why but that's a big problem indeed.

Cheers!

nicolas-raoul commented 2 years ago

Related: https://github.com/atilika/kuromoji/issues/125

Apparently switching to UniDic (might be as simple as modifying pom.xml)would solve that particular case, but it might have lower performance in other areas.