seth-js / yomichan-es

A Spanish hover dictionary. It's a modified version of Yomichan that works with Spanish (Castilian).
16 stars 2 forks source link

Are there any plans to create a "Yomichan-en" for English students? #6

Closed avc1657 closed 7 months ago

seth-js commented 11 months ago

Sorry for the late reply.

You mean like a monolingual English Wiktionary dictionary?

MaxDailene commented 11 months ago

Sorry for the late reply.

You mean like a monolingual English Wiktionary dictionary?

That would be great for advanced or even intermediate learners of English. People who know enough to take advantage of comprehensible input and just look up words on the fly.

avc1657 commented 11 months ago

Sorry for the late reply.

You mean like a monolingual English Wiktionary dictionary?

Hi, I mean a yomichan fork with English parsing capability like this repo which is for Spanish. I actually already have some English dictionaries using yomichan format but since yomichan can't parse English, those dicts are not really that usable at the end of the day. If you have an English dict but the app can't deconjugate English words and so on, it's not usable.

avc1657 commented 11 months ago

It would be great to have a yomichan-en cause a lot of people study English

seth-js commented 11 months ago

@MaxDailene @avc1657 I made a version of yomichan-en last night that has some issues. I'll start testing it and fixing bugs, and hopefully I'll have something ready to go soon.

@MaxDailene Also, sorry about the reply I made to you earlier. I deleted the post a couple minutes after making it because I'd realized how much work I'd given myself. Especially considering that a lot of the code for unreleased languages had never been tested. I plan to release the code used to generate yomichan-en within the repository itself. This code can be re-used for other languages with relatively little editing depending on the language.

seth-js commented 10 months ago

Still working on things. There's a few bugs I've fixed in yomichan itself, mostly how entries and the grammar box are displayed. I also made words with 's get properly parsed.

I'm trying to figure out how to handle nested definitions like "the".

I was also able to rip an English Learner's dictionary, and make it into a Yomichan compatible dictionary, but I probably can't share it here.

seth-js commented 10 months ago

I'm giving up on using Wiktionary's definitions for yomichan-en. The way wiktextract provides nested definitions is breaking everything and is a complete nightmare to parse. Words like "save", "bag", and "the" are having all sorts of issues.

Another problem I've seen is that a lot of definitions aren't really great for English learners.

Compare this: Any of the flying mammals of the order Chiroptera, usually small and nocturnal, insectivorous or frugivorous. with this: an animal like a mouse with wings that flies and feeds at night (= it is nocturnal). There are many types of bat.

Fortunately, the wiktextract data is still extremely useful for handling non-lemmas.

The other dictionary I made is looking great now.

tidy bat

avc1657 commented 10 months ago

I have some english dictionaries here in other formats like Migaku and DSL. Also i have an english frequency list (not in yomichan format tho) and an english deconjugation table (this might give some insights since you're having problems with the parsing, idk). I could share those on discord if you have one.

seth-js commented 10 months ago

I'm good right now, but thanks for the offer.

i have an english deconjugation table (this might give some insights since you're having problems with the parsing, idk).

The way wiktextract provides nested definitions is breaking everything and is a complete nightmare to parse.

I shouldn't have used the word "parse". I meant that pulling nested definitions from wiktextract's English Wiktionary rip is a huge problem because it provides them in pieces, so you have to detect when it begins giving you a nested definition and when it has finished. Even when you have that figured out, turning it into a bunch of nested ols can break how Yomichan counts definitions, which can lead to multiple definitions starting with the wrong number.

The actual parsing (handling non-lemma -> lemma, multi-word phrases) is working nicely.

i have an english frequency list (not in yomichan format tho)

I was wondering earlier about this too. For other dictionaries I've worked on, I ran a parser I wrote through an OpenSubtitles corpus, and made the frequency list from that. Since English has so many frequency lists available, especially for English learners, I just decided to go with the Oxford 3000 and Oxford 5000 list. From that page, I was able to scrape the list, including the CEFR and audio URLs for each word. I modified yomichan-forvo-server to look for Oxford's URLs that I provide first, and otherwise it uses Forvo for English audio. There's some minor setup involved to get that working, but it's worth it because the top 5000 English words have professional voice actors for the audio.

student aesthetic

Everything so far has been mostly geared towards learners of American English, but I will also provide a version that shows British English IPA, and favors British Oxford/Forvo audio.

I want to test everything out some more, but hopefully I can get this released within a week or so.

avc1657 commented 10 months ago

The frequency list i have here is 100k words, dont know their origin, tho. And the format is just like:

[ "you", "i", "the", "to", "a", "it", "and", "that", "of", "in", "what", "is" . . . ]

seth-js commented 10 months ago

The frequency list i have here is 100k words, dont know their origin, tho.

I don't need that, but again, thanks for the offer.

seth-js commented 10 months ago

I tried one more time with Wiktionary, using a different approach to structure nested definitions and it looks like everything is finally properly displayed. I'll probably have yomichan-en release with this Wiktionary dictionary, and I'll release a dictionary specifically for learners on Refold's English discord.

wiktionary-the

seth-js commented 10 months ago

I'm no longer going to work on yomichan-en. I hate to say it, but it won't be released.

The more I use yomichan-en, the more I feel like it's not appropriate to use for English learners. The Wiktionary dictionary is full of unnecessary bloat, vague and complicated definitions, and rare definitions based on extremely obscure slang or dialectical usage I've never even heard of. Also, many forms for English terms are broken right now.

Although I appreciate everything TheMoeWay does, another issue is that yomitan still isn't ready for everyday use after 8 months of Yomichan getting abandoned.

I've been testing goldendict-ng, and the experience is 100x better for English learners. The layout is much cleaner, and you can import multiple dictionaries, including bilingual ones. It can also watch your clipboard similar to Yomichan.

I also found out about gd-tools, and I realized that I can create my own little hover dictionary inside of GoldenDict. I started working on it a couple days ago, and I'm already able to handle multi-word phrases. I'm also going to display an inflection box similar to the one's I've made for my other Yomichan-related projects.

gd

Sorry if I've wasted any of your time.

seth-js commented 10 months ago

@MaxDailene I released yomichan-sr-hr's code for creating the dictionary. Let me know what language you're working on if you need code for a more similar language.

avc1657 commented 10 months ago

So you suggest goldendict-ng to study English? Where can I find dictionaries for it?

seth-js commented 10 months ago

Do this, and let me know if you need help:

  1. Download Oxford Advanced Learner's Dictionary 1 0a (March 2023) (MDX).
  2. Extract it to somewhere, maybe somewhere like C:\Users\[Username]\Documents\GoldenDict Dictionaries\, preferably in its own folder.
  3. Install the latest pre-release of goldendict-ng.
  4. Download this collection of hunspell dictionaries.
  5. Extract it, and copy dictionaries/en/index.aff and dictionaries/en/index.dic to another folder (I went with C:\Users\[Username]\Documents\Hunspell Dictionaries\). Make sure you only copy index.aff and index.dic and that they are in the Hunspell Dictionaries folder.
  6. Open GoldenDict.
  7. Go to Edit > Dictionaries.
  8. Add that folder you extracted the dictionary to ( I put C:\Users\[Username]\Documents\GoldenDict Dictionaries\), you should enable recursive search.
  9. Go to the Morphology tab, change the path to that hunspell directory from earlier (C:\Users\[Username]\Documents\Hunspell Dictionaries\), and enable English Morphology.

I'm still working on the English hover dictionary idea. It's 100% possible, and I was able to make one for Japanese yesterday. I'm just going to wait until that wiktextract issue gets fixed.

seth-js commented 10 months ago

@MaxDailene The source code has undergone major updates. You'll want to use the new code as a base instead.

RyanOrigens commented 10 months ago

The problem of Golden Dict is that this is extremely bloated and because of that it has a huge learning curve.

Also there is to many unnecessary features, i would recommend to you stick to yomichan and try to fix it or try to fork "Crow Translate" and build on top of it.

Crow Translate image

It has a pop up dict as well as a normal one, and it would be able to grab the data automatically from any translation service out there, it just doesn't have Forvo support, parsing or anki integration.

seth-js commented 10 months ago

extremely bloated

Yomichan takes about 20-30 minutes to import a dictionary that takes about 20 seconds to import in GoldenDict-ng. Even after you remove a dictionary in Yomichan, leftover database files may remain. I had to find and manually delete around 5GB of dictionary data that was hiding in an unused Yomichan extension folder in Firefox's appdata.

huge learning curve

It took about 20 minutes for me to figure out how everything works. It also helps that the project is very active, and has developers that were able to answer questions I made on their GitHub issues page within a few hours or even minutes.

too many unnecessary features

I only use the features I need. Yomichan was made for Japanese, so large parts of the Yomichan code are unnecessary and unused in my forks.

it just doesn't have Forvo support, parsing or anki integration.

I've been able to get Forvo, parsing for multiple languages, inflection info, and Anki working with GoldenDict-ng within the last couple weeks. It allows me to do things that weren't possible in Yomichan like adding accents automatically for Russian words, and marking French liaisons automatically. It's insanely useful that I have complete control over the entire process now.

I'll stick with GoldenDict-ng. It's the best option. Considering that tatsumoto-ren is making tools for it, updating mpvacious to work with it, and that it's what's recommended for the Jitendex project, I think other people are starting to see that GoldenDict-ng is amazingly useful software.

avc1657 commented 10 months ago

Do this, and let me know if you need help:

  1. Download Oxford Advanced Learner's Dictionary 1 0a (March 2023) (MDX).
  2. Extract it to somewhere, maybe somewhere like C:\Users\[Username]\Documents\GoldenDict Dictionaries\, preferably in its own folder.
  3. Install the latest pre-release of goldendict-ng.
  4. Download this collection of hunspell dictionaries.
  5. Extract it, and copy dictionaries/en/index.aff and dictionaries/en/index.dic to another folder (I went with C:\Users\[Username]\Documents\Hunspell Dictionaries\). Make sure you only copy index.aff and index.dic and that they are in the Hunspell Dictionaries folder.
  6. Open GoldenDict.
  7. Go to Edit > Dictionaries.
  8. Add that folder you extracted the dictionary to ( I put C:\Users\[Username]\Documents\GoldenDict Dictionaries\), you should enable recursive search.
  9. Go to the Morphology tab, change the path to that hunspell directory from earlier (C:\Users\[Username]\Documents\Hunspell Dictionaries\), and enable English Morphology.

I'm still working on the English hover dictionary idea. It's 100% possible, and I was able to make one for Japanese yesterday. I'm just going to wait until that wiktextract issue gets fixed.

Is there a way to improve the looks of the definitions inside Anki? Cause inside GoldenDict the definitions look like this https://i.imgur.com/ofBbUkX.png but inside anki they look like this mess https://i.imgur.com/fn0w8kg.png

avc1657 commented 10 months ago

Do this, and let me know if you need help:

  1. Download Oxford Advanced Learner's Dictionary 1 0a (March 2023) (MDX).
  2. Extract it to somewhere, maybe somewhere like C:\Users\[Username]\Documents\GoldenDict Dictionaries\, preferably in its own folder.
  3. Install the latest pre-release of goldendict-ng.
  4. Download this collection of hunspell dictionaries.
  5. Extract it, and copy dictionaries/en/index.aff and dictionaries/en/index.dic to another folder (I went with C:\Users\[Username]\Documents\Hunspell Dictionaries\). Make sure you only copy index.aff and index.dic and that they are in the Hunspell Dictionaries folder.
  6. Open GoldenDict.
  7. Go to Edit > Dictionaries.
  8. Add that folder you extracted the dictionary to ( I put C:\Users\[Username]\Documents\GoldenDict Dictionaries\), you should enable recursive search.
  9. Go to the Morphology tab, change the path to that hunspell directory from earlier (C:\Users\[Username]\Documents\Hunspell Dictionaries\), and enable English Morphology.

I'm still working on the English hover dictionary idea. It's 100% possible, and I was able to make one for Japanese yesterday. I'm just going to wait until that wiktextract issue gets fixed.

Is there a way to improve the looks of the definitions inside Anki? Cause inside GoldenDict the definitions look like this https://i.imgur.com/ofBbUkX.png but inside anki they look like this mess https://i.imgur.com/fn0w8kg.png

Apparently that's just how goldendict-ng is as of now. Not really usable if Anki cards look very ugly.

seth-js commented 9 months ago

Yomitan is nearing its first stable release. I plan on updating my Yomichan projects over to it sometime soon. Some minor bug fixes for Yomichan will also be included.

I've been using my GoldenDict setup for a bit now, and although it's nice to have my own parser, and importing new changes to the dictionary is extremely fast, I've decided to continue development on my Yomichan/Yomitan projects instead. I don't usually use Anki, but clearly, as avc1657 mentioned, Anki card creation is not well supported on GoldenDict, unless you only select one definition and not the whole entry. Another annoyance I had was the need to run a server for parsing, whereas Yomichan/Yomitan's approach to parsing is much more simple and easy to work with.

I'm also planning on releasing yomitan-en. It will use OALD definitions over Wiktionary, but Wiktionary's form information will still be used for lemmatization.

The frequency lists for my Yomichan projects will also be updated to be a bit more accurate by having spaCy help during the parsing process.

seth-js commented 7 months ago

You guys can head over to the Yezichak project. It already has a working yomitan-en setup. I will also be releasing the dictionary I've been working on in his Discord server sometime soon.