tshatrov / ichiran

Linguistic tools for texts in Japanese language
MIT License
284 stars 30 forks source link

Limits of hiragana-based romanisation #4

Open epipping opened 8 years ago

epipping commented 8 years ago

Hi,

(this doesn't really belong in a bug report but I'd still like to take a second to say that what you've done here is fabulous, amazing, and incredibly helpful. Thank you!).

I'm not sure I understand completely what goes on in romanize.lisp, but under certain circumstances, it ends up merging an "o" and a "u" that it shouldn't. This issue is mentioned here and 追う is given as an example. The correct reading of 追 is お, so that in hiragana, the word comes out as おう. This transformation is lossy/ambiguous, however: Here, お and う are pronounced separately, in contrast to 王, which, too, is romanised as おう but pronounced as a long お. To romanise 追う as ō is misleading, I think.

I believe that the general rule (and this might make for an easy fix) is: Merging of お and う cannot occur across kanji boundaries. In the presence of kanji, the breakup into hiragana and merging of お and う needs to occur before those tokens are thrown together.

Since I'm not a native speaker (quite the opposite), I checked forvo.com and found a recording that supports the claim that お and う are not joined in 追う: In the recording by the user strawberrybrown, the お and the う can be made out quite distinctly. In contrast, I found a few examples of もう, ぽう, ちょう, and どう that she pronounces as mō, pō, chō, and dō, respectively, just as expected. Which is to say, this user does not generally pronounce お and う sounds separately (as could be the case in a dialect, maybe?) but only when they're really meant to be separate.

There is another recording by the user smime in the same place as linked to earlier where the pronunciation of 追う is more difficult to make out, which corresponds to casual speaking.

Finally, please see also wiktionary for romaji of 追う and .

Update: 子牛 is another example that showcases this problem. The romanisation is currently incorrectly given as kōshi.

tshatrov commented 8 years ago

Yeah, this sounds like a good idea. One problem though, in JMdict database hiragana readings are not separated by kanji so this wasn't possible to implement at the time I wrote romanization algorithm. More recently I have implemented a kanji module (kanji.lisp) that has a function match-readings that can potentially be used to resolve this issue.

epipping commented 8 years ago

Oh. I didn't know that. I guess that makes what I had in mind quite difficult.

Special readings could be a problem. And then, what if (this is entirely fictional. I don't know if a real-world example exists) you have a word made up of two kanji, the first one could be read あ or あお and the latter can be read お or おう? If you only know that the entire word reads あおう, then that could be split into あ-おう or あお-う...

tslater commented 2 years ago

I noticed that the traditional basic option on the site doesn't create the ō. I'm wondering @tshatrov , is there a way to change the romanization settings using ichiran-cli (I'm specifically interested in doing that using the -f option)?

tshatrov commented 2 years ago

@tslater I think if you do (setf ichiran:*default-romanization-method* ichiran:*hepburn-basic*) before building the executable, then -f will use basic romanization.

tslater commented 2 years ago

Looks like it is working. Thanks!