Adds a phonetic reading to the Japanese parser

HugoFara commented 1 year ago

Hi Jeff!

This PR introduces an automated completion of the phonetic reading field for Japanese (in katakana), using MeCab capabilities. I used it extensively as it makes Japanese language learning much easier, I have seen no issue as far as I can tell.

When a user clicks a Japanese term, the server loads the term form with the "romanization" field completed. It minimal but it's a great deal for Japanese learners.

Without this system, you usually have to click a word (for instance 食べる), go to https://jisho.org, copy past the expression with furigana (た食べる is then pasted), then remove any kanji and concatenate the string to get your final たべる.

I hope I got you convinced!

jzohrab commented 1 year ago

Hi @HugoFara - this would be v useful for Japanese learners, great idea. Cheers!

There are a few architectural things that I think should be resolved before merging, I'll add comments to the code.

jzohrab commented 1 year ago

Hi @HugoFara , I pulled your branch into this repo and rebased it further, then did some tweaks on it to get existing tests to pass and to add some other quick tests, and refactored. Branch in this repo: feature/japan-phonetic

Check out the tests in tests/src/Domain/JapaneseParser_Test.php, make sure that's the behaviour you want. :-)

You can reset your branch to this repo's branch and continue working on it if you'd like:

git remote add upstream git@github.com:jzohrab/lute.git
git fetch upstream
git reset --hard upstream/feature/japan-phonetic

HugoFara commented 1 year ago

Hi Jeff! I've pulled and tested your changes, by hand and using unit tests. As far as I can tell, I feel satisfied with the way things work and saw no issue (on Ubuntu). It's all green lights for me, I let the rest to you!

P. S.: as a side note, phonetic help isn't relevant in most cases when the word in is kana (hiragana/katakana) or already in roumaji. But as the overhead to the database of such words is very small, I don't feel it's a priority.

jzohrab commented 1 year ago

Some final questions in Discord before this gets merged in (or rather, before the rebased branch gets pulled in :-) ).

jzohrab commented 1 year ago

Notes from Discord chat for posterity :-)

I wonder if the readings should be in hiragana, instead of katakana.  Most of the books and newspapers I'd read in JP had readings in hiragana.  Any reason for choosing katakana?
Yeah re words already being in hiragana or roumaji -- I can amend the branch to not put any reading if the result is the same as the initial word (e.g., "NHK" would be blank, so would "dochira" (in hiragana)).  NP for that.
I think my pref is for hiragana readings.  Objections? 😛
jz — Yesterday at 9:52 PM
(If you do object, katakana it is 🙂 )

hugofara — Yesterday at 9:54 PM
Well, hiragana/katakana are two of the same kind, but traditionally in Japanese reading (yomikata 読み方) is always in katakana 
I'm not sure why it is so, maybe that avoids confusions with words (usually written in hiragana)

jz — Yesterday at 10:26 PM
Oh I think I’m thinking of furigana
Which are usually hiragana IIRC. I don’t know why but I always found hiragana easier to read than katakana.

hugofara — Yesterday at 11:22 PM
Hiragana are most common that katakana, as you use them in most situations. Katakana are mainly used for animal names, foreign words and ... word reading 
Honestly, it's just another character set, so it's not harder than hiragana once you get used to it. And if we use it for readings in LUTE, users will be highly exposed to katakana so after a few days it won't feel any different

jzohrab commented 1 year ago

I've merged in the code in the updated branch in this repo. Closing this PR. It will get launched with 2.0.1.

Looks great!

And with hira-only (eg "みんな") it doesn't bother giving a reading.

jzohrab / lute

Adds a phonetic reading to the Japanese parser #22