seth-js / yomichan-es

A Spanish hover dictionary. It's a modified version of Yomichan that works with Spanish (Castilian).
16 stars 2 forks source link

How long does it take to create a dictionary like that? #3

Closed RyanOrigens closed 1 year ago

RyanOrigens commented 1 year ago

I saw that the interval of time you took to make spanish and french was approximately 2 days, how long would it take to make another latin language with previous knowledge of the language necessaries, what about a language that you have no previous knowledge about like "Korean" i found a dictionary to korean on github (https://github.com/Samuihasu/krdict-yomichan) to be used with yomichan but not as good as yours though, I'm not a developer and i would like to make a dictionary as yours for korean, could you please clarify the process you used to get to this marivilous result? thanks

anewstheart commented 1 year ago

I would also love to know your dictionary creation process and tools. I am a developer and I recently started with making a Croatian Yomichan dictionary from the Wiktionary data set. Thanks!

seth-js commented 1 year ago

@RyanOrigens

how long would it take to make another latin language with previous knowledge of the language necessaries

How long it takes to create a dictionary depends on if the data dump has no problems, and if I have code that works similar in another language. For example, Spanish to French was extremely simple because the French dump had no problems, and grammatically worked similar. Russian took a long time because there were multiple problems I had to solve, like when "ё" is consistently typed as "е" by native speakers. German also took a while because there is an insanely large amount of garbage data in the German dump that I had to write extra code to filter through.

what about a language that you have no previous knowledge about like "Korean" i found a dictionary to korean on github

I talked with some people about this almost 4 months ago here. Unfortunately, Korean is one of the most complicated languages I've seen so far. The data dump is missing basic stuff like adverbs, and important suffixes. If I were to get started on yomichan-ko, I'd have to start studying Korean myself and read through a grammar book to make sure I somehow cover all possible inflections. Parts of Yomichan's code would have to be rewritten to handle these custom inflection rules. Lyroxide is going to try to make a deconjugator, and that is going to be an insane amount of work.

@anewstheart

I would also love to know your dictionary creation process and tools

I change the code every once in a while, and the code can vary depending on the language. I'm just using Node.js and working with Kaikki data dumps to create a specially formatted set of .json files that then work with custom code I've written for Yomichan itself. I talked a bit about my Yomichan edits here.

I recently started with making a Croatian Yomichan dictionary from the Wiktionary data set

I went ahead and tried making one myself.

It took about 4 hours for me to get a decent result for Serbo-Croatian. I started with the German code and made edits from there.

There were two issues that required me to add custom code:

  1. Some forms don't have proper glosses.

For example, "nove" has all the definitions as one string.

  1. Noun gender tags are missing.

The "tags" key in each sense never has gender info like other languages would. The word "biće" should have "neuter" in its senses' tags, but instead you need to get the gender through the head_templates' expansion string.

Both issues can also be solved if I create an issue on wiktextract to let the developer know there are some problems with the dump.

I'll try releasing it soon if you'd like. It's pretty much completely untested, so it would be nice to get some feedback.

RyanOrigens commented 1 year ago

thanks for your response.

seth-js commented 1 year ago

@RyanOrigens

No problem. I hope you're able to get a decent Yomichan setup for Korean soon. It's a cool language, and I hope Lyroxide's project goes well.

I forgot to mention it, but ChatGPT also gives excellent help for breaking down Korean sentences. You can just give it a prompt like Translate and break down: 떡국을 먹으면 나이가 한 살 는다..

You can also ask follow-up questions about its response. Right now ChatGPT is free to use on OpenAI's site, but once that goes away, it should still be freely accessible through Poe.

anewstheart commented 1 year ago

@seth-js

Thank you so much for your comprehensive answer and for the wonderful service you are providing. Apologies for missing the related issue on yomichan-ru. It was the one repository I didn't check the issues for.

I went ahead and tried making one myself.

It took about 4 hours for me to get a decent result for Serbo-Croatian. I started with the German code and made edits from there.

What! That is amazing! Thanks!

I'll try releasing it soon if you'd like. It's pretty much completely untested, so it would be nice to get some feedback.

I would be thrilled to help you test it and release it. I am ready anytime.

There were two issues that required me to add custom code:

Some forms don't have proper glosses.

For example, "nove" has all the definitions as one string.

Noun gender tags are missing.

The "tags" key in each sense never has gender info like other languages would. The word "biće" should have "neuter" in its senses' tags, but instead you need to get the gender through the head_templates' expansion string.

Both issues can also be solved if I create an issue on wiktextract to let the developer know there are some problems with the dump.

Hopefully the wiktextract developer can issue a fix for those issues. But, it sounds like you have worked around them for now.

The deinflection part is handled in yomichan/js/language/translator.js.

It goes over each result, and looks to see if the result has a special tag named non-lemma. If it has the tag, it looks through each definition. A non-lemma definition will be something like verb {sue -> savoir} feminine singular past participle of savoir (->savoir). Using RegEx, I take out the lemma (savoir), and add that to an object called requiredSearches. It then runs a loop which continuously looks up stuff until there's no more results with the non-lemma tag. This is because sometimes a non-lemma form will point to another non-lemma form. I noticed this happening a lot in Russian (just think of a genitive case of a diminutive form of a lemma), and I finally fixed it once I made a dictionary for Swedish (ex: vises -> vise -> vis).

This is an interesting problem. I am pretty sure the same multiple step lemmatization would happen in Croatian. Maybe later you can explain your process for deriving the lemma from the Kaikki rip. I will be using the dictionary outside of Yomichan so full lemmatizing would need to be included in the final dictionary ( {inflection -> lemma -> lemma}? ).

I understand why your tool-set and methodology are different for each language. But, have you considered including the setup process and tool-set code for each language in the repository as you complete them? It could give people the chance to adapt your process and perhaps make improvements or process their own language.

Thanks again! I am excited to try out the Croatian dictionary. It will make using jiroujisho super useful with captioned videos.

anewstheart commented 1 year ago

@seth-js

Something I forgot to mention is that I created an Anki deck for the most frequent 2,500 Croatian words a few years ago.

https://github.com/anewstheart/croatian-word-frequency-list

As part of the process I merged the frequency lists from Opensubtitles and from the Croatian National Corpus.

https://clarin.si/noske/run.cgi/corp_info?corpname=engri&struct_attr_stats=1&subcorpora=1

The Opensubtitles corpus was fairly different when compared to the national corpus. The main reason for this in my reckoning was that the subtitles are usually translations of English to Croatian rather than original Croatian content. So, a word may appear often as a part of a direct English translation but a different word or phrase would be used for original Croatian content.

Secondarily, the Opensubtitles corpus is exclusively spoken word content whereas the National Corpus is mainly written word. The written word was more formal and complex compared to the spoken word. Ie, the content difference between a newspaper article and a conversation between characters.

In my own amateur way, I believe I created the most accurate frequency list that exists for Croatian up to 2,500 words. I didn't go further because manually cleaning up he list of junk data was very time consuming for only a learner of the language.

Maybe you will find this info useful for Croatian frequency or at least interesting as a thought on using Opensubtitles as a frequency source.

seth-js commented 1 year ago

@anewstheart

It's insanely untested, but let me know how it works for you. https://github.com/seth-js/yomichan-sr-hr

seth-js commented 1 year ago

@anewstheart

Maybe later you can explain your process for deriving the lemma from the Kaikki rip

Each lemma found in Kaikki's rips also has a forms key that contains all possible inflections. I create an object called formDict which points a form to all possible lemmas.

have you considered including the setup process and tool-set code for each language in the repository as you complete them? It could give people the chance to adapt your process and perhaps make improvements or process their own language.

I agree. I'm still working on stuff, but at some point I may release one of them which can then be repurposed for other languages. I'm just worried about the code being used in proprietary projects, and the possibility that I'd have to help people with multiple language projects since I know the setup the best.

jiroujisho

The dictionary isn't meant to work with the default unmodified Yomichan. There may be some issues unless you're somehow able to load in my custom Yomichan on mobile and then load the dictionary from there.

Maybe you will find this info useful for Croatian frequency or at least interesting as a thought on using Opensubtitles as a frequency source.

If I do create a Croatian frequency list, I'll definitely use the OpenSubtitles corpus. I created a parser that's able to handle multi-word phrases, so I may create a frequency list at some point if you really need one. Words that fall into the 95% coverage category would be marked as "popular" in the dictionary.

anewstheart commented 1 year ago

I agree. I'm still working on stuff, but at some point I may release one of them which can then be repurposed for other languages. I'm just worried about the code being used in proprietary projects, and the possibility that I'd have to help people with multiple language projects since I know the setup the best.

Perhaps release the code under GPL so that it can't be commercialized so much? I definitely understand about creating work for yourself, but you don't HAVE to help people working with your code if you are busy :) .

The dictionary isn't meant to work with the default unmodified Yomichan. There may be some issues unless you're somehow able to load in my custom Yomichan on mobile and then load the dictionary from there.

The dictionary loads correctly in jiroujisho (at least for my purposes). I will need to strip the the -automated- string that you use for creating an inflection pop-up because that is not a needed feature.

If I do create a Croatian frequency list, I'll definitely use the OpenSubtitles corpus. I created a parser that's able to handle multi-word phrases, so I may create a frequency list at some point if you really need one. Words that fall into the 95% coverage category would be marked as "popular" in the dictionary.

Neat that you created a parser. My comment was more just some commentary on the quirks of using Opensubtitles as a frequency list vs. other sources.

Thanks again for everything!