jmdict-kindle / jmdict-kindle

Japanese - English dictionary for Kindle based on the JMdict / EDICT database
Other
218 stars 17 forks source link

Add Pitch information from https://github.com/mifunetoshiro/kanjium #8

Closed mymro closed 2 years ago

mymro commented 4 years ago

This dictionary currently uses pitch information from one source. mifunetoshiro/kanjium has around 125.000 words with pitch information. It would increase the entries with accent data dramatically

gitdubblub commented 3 years ago

The Japanese learning community at Refold Japanese discord would greatly appreciate pitch accent information on Paperwhite.. any hopes for integration?

mymro commented 2 years ago

@gitdubblub Pitch accent information is already present in newer releases of the dictionary. Maybe I will look into integrating the additional information around Christmas

gitdubblub commented 2 years ago

The importance of pitch is becoming more aware in the community, so we still look forward to the pitch being added to the Kindle dictionary. Thanks for not dropping the project.

2022年1月20日(木) 10:39 José Fonseca @.***>:

Closed #8 https://github.com/jrfonseca/jmdict-kindle/issues/8 via #20 https://github.com/jrfonseca/jmdict-kindle/pull/20.

— Reply to this email directly, view it on GitHub https://github.com/jrfonseca/jmdict-kindle/issues/8#event-5923425421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVRRZWRMSQIPEW76J764OCTUW7J3PANCNFSM4OYQYAKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

mymro commented 2 years ago

@gitdubblub There is already pitch information present in the dictionary and the additional pitch information has been added. Please download the newest version here.

For information on how the pitch information is encoded, please look into the README. Please note, the combined dictionary (combined.mobi) does not contain pitch information and other features due to size constraints.

gitdubblub commented 2 years ago

@gitdubblub There is already pitch information present in the dictionary and the additional pitch information has been added. Please download the newest version here.

For information on how the pitch information is encoded, please look into the README. Please note, the combined dictionary (combined.mobi) does not contain pitch information and other features due to size constraints.

Thank you so much mymro! Indeed the jmdict.mobi file has pitch information but not the combined.mobi. Maybe you can update that too?

mymro commented 2 years ago

@gitdubblub Unfortunately adding this information to the combined dictionary appears to currently not be possible. There is a hard limit to how big a mobi can get and adding pronuctioation will push the file past the limit.

jrfonseca commented 2 years ago

@mymro is there some database of Japanese's word frequency? One could use such frequency database to create a smaller combined dictionary, with all bells and whistles, for the most frequent words, so that it stays under the size limit. Seems the best way to have the cake and eat it...

PMF-Constantin commented 2 years ago

@jrfonseca There is frequency data in this repo and you can find more probably somewhere else

mymro commented 2 years ago

@jrfonseca I am not a fan of taking words out of the dictionary. You will always have some arbitrary cut off point. The lower the cut off point the less likely it is that one word will not be found. However there is always a possibility of a word not being found due to the missing data.

In my opinion we should get rid of the combined dictionary. It is not really useful for reading anyway. When you look up words very often the first thing you see are names instead of words. You then have to tap through multiple pages until you find the entry you are actually looking for. Also, the few names you do see in books you will inevitably remember at some point.

jrfonseca commented 2 years ago

Before digital dictionaries were a thing, it was common for dictionaries of different sizes: some pocket, some multiple volumes. It's still true today if one buys paper dictionaries. It's a perfectly acceptable trade-off between convenience and completeness.

I haven't used these dictionaries myself in a while, so I don't have an opinion about combined dictionary. I'm happy to be guided by you.

That said, it does seem that some users tend to use up the combined. I don't know if it's a true preference or merely rhethe result misleading naming, as out of the three mobi files available for downlioad combined does sound the best. So, even if we keep combined dict, maybe we should rename it to make it clear that it's not the exact combination of those two, e.g, combined_jmdict_jmndict_nopitch_fewerexamples.

Or we add a Which dictionary should I use? table to the README, with a side-by-side comparison of the three, dictionaries, so folks can choose.

mymro commented 2 years ago

@jrfonseca I am in favour of remaining the combined dictionary. We cannot rely on people reading the README. I also thought about splitting the combined dictionary, but I do not have frequency information for names and those would have to be split too. They make up the majority of entries in the combined dictionary. Of course we could always put all names including the most frequent words in part one and the remaining words in part two. However it does not seem right to me having a second part without names

gitdubblub commented 2 years ago

Thanks man, it's such a blessing to finally have Japanese English dictionary with pitch info for the Kindle paperwhite! I'm really loving it.

2022年2月8日(火) 15:52 mymro @.***>:

@jrfonseca https://github.com/jrfonseca I am in favour of remaining the combined dictionary. We cannot rely on people reading the README. I also thought about splitting the combined dictionary, but I do not have drequency information for names and those would have to be split too. They make up the majority of entries in the combined dictionary. Of course we could always put all names including the most frequent worda in part one and the remaining words in part two. However it does not seem right to me having a second part without names

— Reply to this email directly, view it on GitHub https://github.com/jmdict-kindle/jmdict-kindle/issues/8#issuecomment-1032694765, or unsubscribe https://github.com/notifications/unsubscribe-auth/AVRRZWSLDEVPN5YTM2PQVIDU2EU2BANCNFSM4OYQYAKQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>