seth-js / yomichan-ru

A Russian hover dictionary. It's a modified version of Yomichan that works with Russian.
14 stars 1 forks source link

Guide for creating yomichan dictionaries for other languages #1

Closed HIKORIN01 closed 1 year ago

HIKORIN01 commented 1 year ago

First of all I'd like to say thank you, your work has been extremely useful and I appreciate you posting this for the world to see. If possible, would you be able to make a step-by-step guide for those who want to make these for themselves? (For example I need a spanish and french version.) Thank you again!

seth-js commented 1 year ago

Thanks for the kind words. I don't think I'll be able to provide a guide because it's a complicated process and it can change greatly depending on the language.

Each language I work with requires me to:

On a positive note, I've had yomichan-es and yomichan-fr working well for about 2 months now. When I have some time, I'll try to set up the repositories for them.

Also, where did you find this repository? I haven't exactly been that public about it.

HIKORIN01 commented 1 year ago

I was so desperate for yomichan in other languages that I was literally checking the newest github releases that included the word 'yomichan' daily lol

Lyroxide commented 1 year ago

@seth-js Sorry for hijacking. I am currently making Yomichan dictionaries for Korean so I am very curious as to what exactly did you modify Yomichan? It seems like the deinflected word is shown automatically when the word is inflected. I checked your term bank json files and it seems like you used the same method (duplicating the definition for each inflection). How did you code the deinflection?

seth-js commented 1 year ago

@HIKORIN01 The repositories should be public now. I haven't verified that I uploaded everything correctly, so let me know if there's any problems.

https://github.com/seth-js/yomichan-es https://github.com/seth-js/yomichan-fr

@Lyroxide

as to what exactly did you modify Yomichan? How did you code the deinflection?

You can read the Yomichan source code in either yomichan-es or yomichan-fr. These have the latest features and fixes. Do a global search for Custom edits, you should get 13 results of different files I modified.

The deinflection part is handled in yomichan/js/language/translator.js.

It goes over each result, and looks to see if the result has a special tag named non-lemma. If it has the tag, it looks through each definition. A non-lemma definition will be something like verb {sue -> savoir} feminine singular past participle of savoir (->savoir). Using RegEx, I take out the lemma (savoir), and add that to an object called requiredSearches. It then runs a loop which continuously looks up stuff until there's no more results with the non-lemma tag. This is because sometimes a non-lemma form will point to another non-lemma form. I noticed this happening a lot in Russian (just think of a genitive case of a diminutive form of a lemma), and I finally fixed it once I made a dictionary for Swedish (ex: vises -> vise -> vis).

Korean is actually one of the next languages I was thinking about trying to make something like this for.

Let me know if you still need help, I'll be happy to clarify more or answer questions.

seth-js commented 1 year ago

I just checked out the Wiktionary dump for Korean. Unfortunately, it looks like a massive amount of possible conjugations aren't covered. Suffixes break lookups for verbs and adjectives. Making a dictionary for Korean that handles inflected words is way more complicated than anything I've seen before, even more than Russian or Japanese. It doesn't help that digital Korean dictionaries (offline JSONs) are extremely lacking when compared to other languages.

I'll be avoiding Korean dictionary development for now.

KamWithK commented 1 year ago

Not sure if it helps but I scraped a few Korean dictionaries recently: https://github.com/KamWithK/KrDictionaries

Maybe you can just use krdict + a korean conjugation/deconjugation nlp library instead off manually writing out the conjugations?

seth-js commented 1 year ago

@KamWithK

a korean conjugation/deconjugation nlp library

Any recommendations? One of the first language projects I did a couple years back was an Electron Polish hover dictionary. I was using spaCy to lemmatize words, but it was painfully slow. There was also no information telling me about the case of the word, the gender, the tense, etc. Is it possible to generate a conjugation/declension table from a lemma now?

It's been bugging me that I gave up so quickly. Especially since Korean Wiktionary already has over a hundred thousand possible inflections in there. Just take a look at any verb conjugation table. I'm noticing that adverbs were for some reason completely ignored in Wiktionary (크다 -> 크게). If I take some time to read a Korean grammar book, I might just manually set up conditions where if the word has an adjective POS, then set up a custom form to cover the adverb. There would of course be other suffixes I'd have to handle (-히, -으로). Another problem I noticed was the form "기억하라" of the lemma "기억하다" is missing from Wiktionary. The imperative suffix "-라" will also need to be handled.

This looks like a decent amount of work if I want non-lemmas to point to lemmas and get information about what the inflection signifies.

Lyroxide commented 1 year ago

You can check out my repo. I actually used https://github.com/Kyubyong/KoParadigm to generate all the possible inflections. I then duplicated entries if the character before 다 does not match the conjugations.

The only work left to do is to point the non-lemmas to lemmas, but because of the sheer number of homonyms, it is difficult so for the time being, it is probably best to allow the user to do a manual second look-up for the lemma.

Another way is to use KoParadigm to generate all the possible inflections, and check if the text scanned is one of those inflected words and return the original verb. This is also not practical because there are at least 70,000 * 100 words to go through.

Perhaps, recreating KoParadigm inside Yomichan might be wiser. You can read their paper on how they tackled this. Basically they decomposed the word into individual jamo, and appended ending words according to their verb classes.

KamWithK commented 1 year ago

@KamWithK

a korean conjugation/deconjugation nlp library

Any recommendations? One of the first language projects I did a couple years back was an Electron Polish hover dictionary. I was using spaCy to lemmatize words, but it was painfully slow. There was also no information telling me about the case of the word, the gender, the tense, etc. Is it possible to generate a conjugation/declension table from a lemma now?

It's been bugging me that I gave up so quickly. Especially since Korean Wiktionary already has over a hundred thousand possible inflections in there. Just take a look at any verb conjugation table. I'm noticing that adverbs were for some reason completely ignored in Wiktionary (크다 -> 크게). If I take some time to read a Korean grammar book, I might just manually set up conditions where if the word has an adjective POS, then set up a custom form to cover the adverb. There would of course be other suffixes I'd have to handle (-히, -으로). Another problem I noticed was the form "기억하라" of the lemma "기억하다" is missing from Wiktionary. The imperative suffix "-라" will also need to be handled.

This looks like a decent amount of work if I want non-lemmas to point to lemmas and get information about what the inflection signifies.

This is the first result which comes up when I search for Korean NLP: https://konlpy.org

Spacy you'd expect to work well but well they're neural networks they use aren't they so of course it's going to run a lot slower (but if they were trained well it'll probably give good results) Haven't actually used it recently though so I'm not really 100% sure