himselfv / wakan

Japanese and Chinese learning tool with dictionary
39 stars 7 forks source link

Support for punctuation in dictionaries, user input #136

Closed himselfv closed 11 years ago

himselfv commented 11 years ago

Original report by me.

Originally reported on Google Code with ID 136

Dictionaries (both EDICT/2 and C/CCEDICT) have some punctuation marks in some records.
Most notorious are:
  ・ in EDICT and · in CEDICT to separate name from surname
  、 in EDICT and , in CEDICT in proverbs and idioms
  and some others

As it were, records with punctuation were not imported / punctuation was silently stripped
in other cases.

We need a consistent strategy to deal with punctuation. How to import it, how to store
and how to deal with punctuation coming from user (in lookups).

Reported by himselfv on 2013-03-27 09:00:31

himselfv commented 11 years ago
Currently chosen solution:
1. Preserve the punctuation when converting Romaji/Pinyin<->Kana/Bopomofo.
2. Store kanji and kana with punctuation in the db.
 Direct kanji and kana lookups have to have punctuation in place.
3. Strip punctuation from romaji signature.
 Also strip punctuation from all user input in romaji. 

Direct romaji lookup: strip punctuation and search by roma.
Deflexed romaji lookup (requires clean roma): convert to kana, produce deflexions,
make lookups. (Note that this almost never happens: punctuation is mostly in names
and idioms which have no use for current means of deflexion)
Direct kana/kanji lookup: just look up for the text.
Deflexed kana/kanji lookup: no change.

Reported by himselfv on 2013-03-27 10:05:38

himselfv commented 11 years ago
Problem: some punctuation is Unicode and romaji always uses ANSI (since it's stored
in ansi in db).
Solution: since we only support some explicit punctuation, when converting to romaji
just replace unicode commas etc with ansi versions.

Reported by himselfv on 2013-03-27 10:33:41

himselfv commented 11 years ago
A trick we don't use but can employ in the future:

If we need to search for kana but with some leeway in what we accept (like with roma
lookups) we can:
 1. Deflex properly while in kana.
 2. Convert all lookups to roma (punctuation-less).
 3. For every result, check that source kana satisfies us (for instance that it has
appropriate punctuation or something).

We do something similar now when looking for kana-only words; this allows us to fetch
results even in different kana (i.e. ヒク for ひく lookup).

Reported by himselfv on 2013-03-27 10:37:49

himselfv commented 11 years ago
I think this was solved at some point. Why is it not closed?

Reported by himselfv on 2013-04-10 13:38:45