mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
47 stars 6 forks source link

Add option to skip cards that have the same lemma #201

Open RyanMcEntire opened 3 months ago

RyanMcEntire commented 3 months ago

Describe the bug

Marking a word as known does not prevent a new inflection from appearing.

For each inflection of a root word, ankimorph treats it as a new word, and sets it as the new priority. Other inflections of the word are not suspended or marked as known.

Recalcing and changing the settings don't improve this behavior.

Steps to reproduce the behavior

  1. use subs2srs library as morph source
  2. use ko_core_news_sm with spacy to generate frequency list from corpus, or use collection frequency.
  3. use ko_core_news_sm/md/lg morphemizer in note filters
  4. choose setting for "am-unknowns field shows morph lemmas"
  5. check suspend new cards with only known morphs
  6. recalc
  7. start reviews and mark morphs as known
  8. watch the same "lemma" come up for review each time it appears as a different inflection in the sentence.

Expected behavior

I expect the morphemizer to distill a word to something like a lemma (spacy isn't capable of doing this properly with its korean models, but that may or may not be a separate issue). Id expect ankimorph to show me new words and bury variations of the same word. Just as if I were learning english, i don't need a card for walk, walking, walked, will walk, might walk, want to walk, and such for every single word.

Currently, it treats each inflection as a new word, so it behaves no different than if it were being separated by spaces.

My setup

Additional context

Spacy has 3 korean models, ko_core_news_sm, ko_core_news_md, and ko_core_news_lg. They all functionally work the same way

The website states that it lemmatizes korean, and this isn't technically true. The Lemma_ value returned by spacy looks like this, with the raw word on the left and "lemma" value on the right:

('준비했죠', '준비+하+었+죠')
('위해서', '위하+어서')
('먹을', '먹+ㄹ')

The lemma isn't a lemma at all, but rather a break down of each word part, and the left-most part is only the "stem", which isn't the dictionary form of the word at all. the verb for "to eat" is 먹다, not 먹. 먹 is a rare noun for an ink stick used for making writing ink.

A proper lemma value for these would look like this, placing them in their dictionary form:

('준비했죠', '준비하다')
('위해서', '위하다')
('먹을', '먹다')

to explain with a single word, this is what spacy produces:

('먹다', '먹+다')
('먹었어', '먹+었+어')
('먹는데', '먹+는+데')

lemmatized properly it would look like this, where these would all be inflections of the same word:

('먹다', '먹다')
('먹었어', '먹다')
('먹는데', '먹다')

As you can see from the frequency list generated by ankimorph, a word like 괜찮다 takes up 1034 slots on the frequency list. image The value of using a morphemizer other than spaces is basically entirely lost.

mortii commented 3 months ago

@RyanMcEntire thanks for the detailed feedback!

First off, I just want to say that Korean is super weird, so it's tricky to make it work within the same general framework as other languages.

I expect the morphemizer to distill a word to something like a lemma (spacy isn't capable of doing this properly with its korean models, but that may or may not be a separate issue).

This already hits the nail on the head--if spaCy is used as the morphemizer then the lemmas aren't going to be "correct". I basically had to hack the lemmatizer to make it produce something something that conforms to other languages: https://github.com/mortii/anki-morphs/blob/2d47bac7628d913afb335db01d19493f70d03355/ankimorphs/spacy_wrapper.py#L74-L84

To get proper lemmas we would have to use a different morphemizer entirely, which may or may not exist, I don't actually know.

Id expect ankimorph to show me new words and bury variations of the same word. Just as if I were learning english, i don't need a card for walk, walking, walked, will walk, might walk, want to walk, and such for every single word.

That's understandable, however, this is a feature, not a bug. If you want to ignore all inflections after learning a lemma then you might get very confused when encountering "flew" and "flown", after learning "fly". That being said, a lot of people might prefer that approach, which is why we are adding a "treat all lemmas as known" to the new algorithm. I don't know if we should add an equivalent "skip" option before that's implemented, since it would only bury those cards for one day. We should definitely add that option at some point though.

RyanMcEntire commented 3 months ago

@mortii thanks for the response. That clears a lot of things up.

I can see how it would be useful if the variations were just "fly, flown, flying, flew" so that makes a lot of sense. Unfortunately with the way Korean works here, it's not just 2 or 3 variations. It's more like 500-1500 or more per word in the case of commonly used verbs. My stats look like U: 5950 A: 7409 and anki morph is still constantly feeding me variations of 있다 (to exist), and 하다 (to do) the most common lemma in the language.

I can also see how this might be ok for someone new to the language to not get confused, and it might work better if they didn't have as many cards as I have here.

If an option to mark all inflections known is being included, that would be an enormous help. Until then I'm not sure if the tool is usable for my situation.

RyanMcEntire commented 3 months ago

@mortii I should also mention, I was messing around with https://pypi.org/project/soylemma/ which actually goes to accomplish pretty decent deconjugation of korean verbs. It has its own fallbacks, but its much closer to producing a proper lemma in plain dictionary form.

It builds on KoNLPy.

Maybe its off topic in this particular thread. let me know if you'd like me to open a different one.

mortii commented 3 months ago

Unfortunately with the way Korean works here, it's not just 2 or 3 variations. It's more like 500-1500 or more per word in the case of commonly used verbs. [...] Until then I'm not sure if the tool is usable for my situation.

That is completely fair, with the current state of the addon it might just be useless for learning Korean :(

I was messing around with https://pypi.org/project/soylemma/ which actually goes to accomplish pretty decent deconjugation of korean verbs.

That is super interesting! I suspect almost anything would be better than the spaCy lemmatizer, so including something like this might be a huge improvement. You can create a discussion thread about it if you want :)