himselfv / wakan

Japanese and Chinese learning tool with dictionary
36 stars 7 forks source link

Faster chinese deflexion lookups #135

Open himselfv opened 11 years ago

himselfv commented 11 years ago

Original report by me.

Originally reported on Google Code with ID 135

Every pinyin syllable has a tone, i.e. pin3yin4. Words with different tones are different
words, each with it's own dictionary entry.

When doing dictionary lookups, people usually omit tones and type simply "pinyin".
As a part of handling input, Wakan guesses the pinyin syllables and pastes 0 after
each:
  pin0yin0

Then, when generating possible DB lookups, it adds all combinations of tones instead
of blanks:
  pin1yin1
  pin1yin2
 //...
  pin4yin4

This is slow and the time needed grows exponentially with the length of the string.
It also doesn't work when the string can't be fully parsed to pinyin (contains latin
symbols). It will not distinguish between pinyin and latin letters similar to it.

We have to find a better way of doing these lookups.

One suggestion in vein with other similar tasks (listing all matches to unparseable
latin+pinyin) is to store pinyin signature in db without tones at all.
Then we'll get all matches with a single lookup.
Drawback: If the user actually specifies tones explicitly, we'll have to do:
  CompareStr(StripPunctuation(userInput), StripPunctuation(KanaToRomaji(matchKana)))
For every match, because we need to compare to tonified romaji which can only be produced
from kana.
It's even worse if user specifies tones for *some* of the syllables. And the string
is unparseable (contains latin). I see no way to match it against anything, so this
is an obstacle.

Low priority because lookups are working as they are and this is not blocking anything.

Reported by himselfv on 2013-03-27 08:46:53

himselfv commented 11 years ago
Another drawback: if we change romaji signature field too much, existing dictionaries
will stop working.

Reported by himselfv on 2013-03-27 08:52:16

himselfv commented 11 years ago
Once again: at this time kanji/kana requests deflex, pinyin requests mostly deflex,
only pinyin+latin mixed direct requests will not deflex.
Given that there's only <100 of such records at all, this is a non-breaking bug.

Reported by himselfv on 2013-03-27 11:03:18

himselfv commented 11 years ago

Reported by himselfv on 2013-04-23 11:43:48