himselfv / wakan

Japanese and Chinese learning tool with dictionary
36 stars 7 forks source link

Kanji lookup by romaji ON fails on long vowels #283

Open himselfv opened 9 years ago

himselfv commented 9 years ago

Original report by me.

Kanji search gives no results when filtering by On readings with long vowels, written in romaji, such as "chuu" (チュウ) or "juu" (ジュウ). These are fairly common so there should be lots of results.

The reason is that when converting to katakana, long vowels (uu) are converted to prolonged vowels (ウ-) since this is what you usually expect (katakana words mainly use prolonged vowels).

Yet ONs in the Kanjidic never use ュ- and always use the ュウ form.

Workaround 1: type katakana with OS keyboard.

Workaround 2: type チュ then type ウ in the editor, then filter by that.

Possible solution 1: when filtering kanji, set a flag and produce and try both versions of katakana.

Possible solution 2: maybe chuu and juu are special cases and should always be converted to full form even in katakana? This should be studied (and also how other transliteration systems behave).

WaiYanMyintMo commented 3 years ago

Solution 1 is the best IMO. There might be false positives but its better than missing searches. Besides, with the workaround the user would have to try two versions anyway, why not do it in software?

WaiYanMyintMo commented 3 years ago

Is there a rough documentation on how the code-base is structured? I'll see what I can do, and it'll be a good learning experience for me to learn Pascal as well. As a thank you for your hard work in this project.

KarolS commented 3 years ago

@Stiles-X I'm pretty sure the code that handles that is at https://github.com/himselfv/wakan/blob/master/Modules/KanjiList/JWBKanjiList.pas#L821

himselfv commented 3 years ago

@Stiles-X The majority of dictionary lookups is in https://github.com/himselfv/wakan/tree/master/Modules/Dictionary . There are explanations at the top of the unit files.

JWBDic is the dictionary file format itself (JWBIndex and JWBEdictMarkers are supporting modules), JWBDicSearch is high-level search over it. JWBDictionaries is the app-wide loaded dictionary collection. The rest are UI forms for lookups.

The even lower level parts (caching readers/writers, encodings, EDICT/CCEDICT formats) are in https://github.com/himselfv/jptools/tree/master/Share . Normally you don't need those as they rarely change and mostly just work.

Kana converters are here: https://github.com/himselfv/jptools/tree/master/KanaConv . It's here that a flag to skip チュウ -> チュ- conversion is needed. But if I remember right, the way kanaconv works, it just follows the rules set by a roma file, so if roma file says "chuu -> チュ-" (oversimplifying), it can't be changed by a flag.

But I guess, another solution is to always convert romaji to hiragana in Kanji search input (which gives us ちゅう), and then convert hiragana->katakana and look up both "ちゅう" and "チュウ" and perhaps even "ちゅ-" and "チュ-" (by starting with katakana). KANJIDIC prefers "チュウ" but that's not to say other dictionaries won't have "チュ-" entries.

Even better, is there maybe a standard way to type specifically long vowel or double vowel when typing in romaji? Something like CHU U for double vowel and CHUU for long vowel.

himselfv commented 3 years ago

Implemented the above. We can produce explicit "prolongation sign" in most romaji systems by doing "chu-". At least this approach won't break this and won't add チュウ results, which is good.

KarolS commented 3 years ago

The problem is that 中 is listed as チュウ in the kanji dictionary, so I'd expect to find it if I typed "chuu". The only kanji that have a chouonpu in their yomi are weird obsolete garaigo like 糎 (センチメートル; this is kunyomi btw, not onyomi as Wakan currently thinks).

As it is, even if I switch to a romanization (or more like, cyrillization) scheme that distinguishes between チュウ and チュ-, I still can't find 中 in the kanji search tab: I switched to Polivanov and neither тю: (colon is used to mark a long vowel) nor тюу finds anything.

himselfv commented 3 years ago

KarolS: I forgot to push the changes :) Sorry. If you're working from source, please try the master branch now.