nicolas-raoul / jakaroma

Java library and command-line tool to transliterate Japanese kanji to romaji (Latin alphabet)
Apache License 2.0
62 stars 8 forks source link

mecab-ipadic does not contain 機能 nor 作用 #14

Open LanaTimko opened 2 years ago

LanaTimko commented 2 years ago

Hello @nicolas-raoul,

We use your library in our product for Kanji symbols transliteration to Romanji. In some cases results are not correct:

Will it be possible to fix the library for better transliteration Kanji symbols to Romaji?

Thanks in advance, Lana

nicolas-raoul commented 2 years ago

Hello Lana,

wow, that's a bug, thanks for letting us know, and please report any other similar problem you can find.

nicolas-raoul commented 2 years ago

Interestingly, when I just tried, these words were left as-is:

$ ./jakaroma.sh 機能仕様書
機能Shiyo- Sho
$ ./jakaroma.sh 機能
機能
$ ./jakaroma.sh 作用
作用

which is not great either, but arguably better than outputting mistaken romaji.

Are you using the code found in the GitHub master branch? Or did you modify it somehow? For instance did you switch to another dictionary?

LanaTimko commented 2 years ago

Hello Nicolas,

yes, we use maven version of your library from master, we didn't change anything. So we use your standard dictionary.

LanaTimko commented 2 years ago

Interestingly, when I just tried, these words were left as-is:

$ ./jakaroma.sh 機能仕様書
機能Shiyo- Sho
$ ./jakaroma.sh 機能
機能
$ ./jakaroma.sh 作用
作用

which is not great either, but arguably better than outputting mistaken romaji.

Are you using the code found in the GitHub master branch? Or did you modify it somehow? For instance did you switch to another dictionary?

As I wrote earlier we use your standard dictionary and your the latest master version without any additional change. But in our logic we use Chinese transliteration by default for Kanji symbols. This behavior was changed by special property for Japanese customers, in this case your library is applied. But if it can't transliterate the symbol (as in examples) our default behavior works (that's why you saw words left as-is and we got Chinese transliteration). You will help us a lot if you fix that issue and that symbols (機能, 作用) will be transliterated to Romanji in correct way. Now we're going to apply workaround in our product, and will be looking forward for your fix to implement the proper behavior.

Thanks in advance! Lana

nicolas-raoul commented 2 years ago

I just downloaded the dictionary http://atilika.com/releases/mecab-ipadic/mecab-ipadic-2.7.0-20070801.tar.gz (EUC-JP) and found out that Noun.csv contains 仕様 but does not contain 機能 nor 作用 as a single noun. That is probably the problem.

Unfortunately I am currently working on other projects, but could could you please try to find an updated version of that dictionary? Or find the process to add new words to that dictionary. Please post your findings here. Thanks a lot!