polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
286 stars 20 forks source link

Inconsistent romaji conversion of しょ #62

Closed fr3nd closed 3 weeks ago

fr3nd commented 3 weeks ago

Hi,

First off, big thanks for the creation of Cutlet, I think it's really great and really helpful for my project!

I think I found a strange behavior of the library and I'm not sure if it's a bug or not. When converting "しょうり" to Hepburn romaji, it's converted to "Shiyoori", which doesn't seem right to me. However, other conversions with "しょ" seem to work properly.

Check out this example:

>>> import cutlet
>>> katsu = cutlet.Cutlet()
>>> katsu.use_foreign_spelling = False
>>> katsu.romaji("しょうり")
'Shiyoori'
>>> katsu.romaji("しょ")
'Sho'
>>> katsu.romaji("しょうゆ")
'Shouyu'

Is this behavior correct? Am I missing something?

Cheers

polm commented 3 weeks ago

Glad to hear cutlet's been useful to you.

Basically you have a weird case where the underlying Japanese tokenizer is failing to work as you would expect. The issue is not with しょ specifically and isn't related to small kana.

When debugging, it's often helpful to run things through fugashi to see what's happening. With my local setup I get this:

しょう  シヨー  スル    為る    動詞-非自立可能 サ行変格        意志推量形
り      リ      リ      り      助動詞  文語助動詞-リ   終止形-一般
EOS

Here we can see we're getting two tokens, the first of which is しょう. But it's being interpreted as a form of suru - it's specifically being interpreted as a variation on shiyou. And for some reason the canonical form is not shiyou, but シヨー/shiyoo - not really sure why that would be, but sometimes slangy words are like that phonetically in UniDic. (Compare やろうぜ, which becomes Yarou ze.)

It seems very hard to get the dictionary to parse しょうり as a single noun instead of a variety of weird interpretations, but on the other hand, unlike しょうゆ, 勝利 would not be written only in katakana in a normal document.

Are you working with documents written primarily in hiragana? If you just need a straightforward kana to romaji conversion you could be better off using map_kana like so:

>>> from cutlet import Cutlet
>>> katsu = Cutlet()
>>> katsu.map_kana("しょうり")
'shouri'
fr3nd commented 3 weeks ago

(edited because I misunderstood the solution)

Hi, and thanks for the quick response.

I need to convert directly from hiragana because in some cases I need to get all the possible readings from a kanji form, so I think that this solution will work for me. I wasn't aware of the map_kana function.

Thank you!