unobliged / plymlet

plymlet rails test code
http://plymlet.herokuapp.com
0 stars 0 forks source link

Overhaul dictionary lookup for general purpose use #20

Open unobliged opened 12 years ago

unobliged commented 12 years ago

Right now, it only works for one language (Chinese), update the lookup function, Redis, and the passage view to take into account the stated language of the passage and have it query the appropriate dictionary. Some thought will be needed on how to manage Redis with multiple dictionaries, start off with EDICT as a 2nd dictionary and add a sample Japanese passage.

unobliged commented 12 years ago

Focus on: -Sample Japanese passage --Update: http://www.jlptstudy.net/N5/index.html -Getting EDICT and setting up similar structure on Redis as with CEDICT --Think about whether this should be tied to the language attribute of Passage --Update: EDICT does not support romaji, may need a bridge script for later but for now just get functionality working For romanization see: http://en.wikipedia.org/wiki/Romanization_of_Japanese leaning towards hepburn since that is what I learned

unobliged commented 12 years ago

This may save some headaches since EDICT is in EUC-JP encoding: http://www.localizingjapan.com/blog/2012/07/16/japanese-encoding-conversion/ Update: converted the file to UTF8, still need to create new script to read it Update2: file is actually available in UTF8 here: ftp://ftp.edrdg.org/pub/Nihongo/edict2u.gz

unobliged commented 12 years ago

For now, bypassing Redis and just saving directly from define_word_EDICT to Passage.vocab_list for japanese passages (rake task now does language check). The EDICT define_word method needs some work, it is not as accurate as the CEDICT version due to the particular way data is arranged in the file. It might be worth somehow modifying the file or storing it in a table for more precise key lookup given that there are typically 3-6 keys that map to one definition (1-3 kanji, 1-3 kana). Will need to think it through more, but for demo purposes the sample japanese passage view look more like it should.

dpaola2 commented 12 years ago

You're kicking ass. Keep it up, Brian :-)

On Sep 18, 2012, at 8:52 PM, Brian notifications@github.com wrote:

For now, bypassing Redis and just saving directly from define_word_EDICT to Passage.vocab_list for japanese passages (rake task now does language check). The EDICT define_word method needs some work, it is not as accurate as the CEDICT version due to the particular way data is arranged in the file. It might be worth somehow modifying the file or storing it in a table for more precise key lookup given that there are typically 3-6 keys that map to one definition (1-3 kanji, 1-3 kana). Will need to think it through more, but for demo purposes the sample japanese passage view look more like it should.

— Reply to this email directly or view it on GitHubhttps://github.com/unobliged/plymlet/issues/20#issuecomment-8678548.

unobliged commented 12 years ago

Thanks! Still a long way to go, but slowly chipping away :p

On Wed, Sep 19, 2012 at 12:22 AM, Dave Paola notifications@github.comwrote:

You're kicking ass. Keep it up, Brian :-)

On Sep 18, 2012, at 8:52 PM, Brian notifications@github.com wrote:

For now, bypassing Redis and just saving directly from define_word_EDICT to Passage.vocab_list for japanese passages (rake task now does language check). The EDICT define_word method needs some work, it is not as accurate as the CEDICT version due to the particular way data is arranged in the file. It might be worth somehow modifying the file or storing it in a table for more precise key lookup given that there are typically 3-6 keys that map to one definition (1-3 kanji, 1-3 kana). Will need to think it through more, but for demo purposes the sample japanese passage view look more like it should.

— Reply to this email directly or view it on GitHubhttps://github.com/unobliged/plymlet/issues/20#issuecomment-8678548.

— Reply to this email directly or view it on GitHubhttps://github.com/unobliged/plymlet/issues/20#issuecomment-8678925.

unobliged commented 12 years ago

For adding words to vocabulary list, there will need to be an attribute for the language; a migration will be needed and support for this function needs to be extended to User views. Currently I am working on how best to store dictionaries in database; most likely each one will have a separate table to keep it simpler and for making updates/swaps more straightforward (also removes needs for all words to have a language attribute).

unobliged commented 12 years ago

This may come in handy for narrowing down key search: http://stackoverflow.com/questions/3826918/how-to-classify-japanese-characters-as-either-kanji-or-kana

unobliged commented 12 years ago

It turns out array columns might be a better fit for Japanese due to the way lookup will need to be structured and the many->many structure for words and their associated information in EDICT2/JMDict, but the gem for this isn't as advanced in functionality as hstore: http://www.postgresql.org/docs/9.2/static/arrays.html https://github.com/tlconnor/activerecord-postgres-array I will stick with serialized array columns and see how that goes. CEDICT is probably the better candidate for using hstore and I will continue to use that and learn from it.

unobliged commented 12 years ago

Minor update, dictionaries in database will have to be held off due to 10k row limit on heroku free tier. Will definitely need segmentation for each dictionary though, but see edict2_parser app for ideas on dictionary schemas. Perhaps it might be better to develop a more efficient parser for the flatfile and use those as a dumb database...