mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
Mozilla Public License 2.0
48 stars 6 forks source link

Update what is considered the card's unknown morph when changing the text in the am-unknwon field #152

Closed quietmansoath closed 5 months ago

quietmansoath commented 5 months ago

Is your feature request related to a problem? Please describe.

I'm sorry for creating so many issues (lol). I just really want to move away from Morphman. Anyway this is something that morphman does not seem to do either.

AnkiMorph does not always properly identify the unknown word. For example, it determines the unknown word is 住み, when actually, it's 用済み. If I simply delete 住み and add 用済み in its place, the card is still "acting" as if 住み is the unknown word, and even continues to highlight it in the text when using am-highlighted.

This is frustrating because I do not know either word, and I would like to have separate cards for each. However, I may only have one card that has both of these words in it. This is also frustrating because sometimes I might prefer one word over the other in tricky cases like AM finding とっ, but not とっ捕まえる, the latter of which I want. This only gets worse if the unknown word is actually a phrase that I want considered as a whole instead.

Describe the solution you'd like I would like for any text that I add to the field am-unknowns to be reflected back in the database of known and unknown words so that if I update the text within the field, that new text is now seen as the new unknown word.

Describe alternatives you've considered In the past I just ignored this behavior and overwrote the text in the field anyway, but the more cards I study the more I realized I'm finding more unknown words based on what (then) morphman and (now) AnkiMorphs thinks I know. It gets complicated when I don't have alternative examples of an unknown words or phrases that would allow me to just learn both the card with the original unknown word and a card with the other unknown word I had to manually change.

I realize this might not be fixable, but I still figured I would try to say something. Hopefully not being a bother!

Additional context Here are screenshots. "Morph" field is what morphman's parser found, for reference.

Screenshot 2024-02-07 102905 Screenshot 2024-02-07 103211

This one I was also unsure about since I didn't know 出番 or どうやら, so ultimately I had to choose one and find a hacky work around to study 出番 Screenshot 2024-02-07 103039

This last image is an example of the problem using am-highlighted. Screenshot 2024-02-07 104024

mortii commented 5 months ago

I would like for any text that I add to the field am-unknowns to be reflected back in the database of known and unknown words so that if I update the text within the field, that new text is now seen as the new unknown word.

Sorry, but implementing that would be an absolute nightmare, and it would likely give very unexpected and undesirable behavior, so I'll have to pass on that.

AnkiMorph does not always properly identify the unknown word. For example, it determines the unknown word is 住み, when actually, it's 用済み.

No parsers/morphemizers are perfect, especially Japanese ones. Have you tried the spaCy Japanese models? It's trickier to setup, but it might give you better results, depending on your preferences. They tend to split morphs more aggressively, which can be nice if you like a more grammar approach rather than learn-through-volume. Using the morphemizer spaCy: ja_core_news_sm actually recognizes 用済み:

Screenshot from 2024-02-08 12-58-53

but I find that generally the AnkiMorphs: Japanese morphemizer (non-unidic-mecab) tends to do this better.

Here are screenshots. "Morph" field is what morphman's parser found, for reference.

Those discrepancies are pretty weird. Are you using the standard japanese morphemizer for morphman or did you download something extra?

I realize this might not be fixable, but I still figured I would try to say something. Hopefully not being a bother!

No, it's great! The more feedback the better!

mortii commented 5 months ago

Also, based on these pictures alone:

image image

I can't tell if the morphs are not found, or that AM determined that you already know them. If you upload your deck then find out pretty quickly which it is.

Edit: You can also check it by finding the card in the browser -> right click it -> view morphemes.

quietmansoath commented 5 months ago

Those discrepancies are pretty weird. Are you using the standard japanese morphemizer for morphman or did you download something extra?

Ohhh, I just realize I still have MeCab enabled because I was using it with Morphman. I wonder if that's part of the issue?

I checked the top card, and it shows this: image

I do actually have 捕まえる as a card I marked as already known, but I would have thought AM would have still parsed とっ捕まえる as a separate unknown word.

The second one it looks like I screwed up and marked as known a card I shouldn't have by accident.

Sorry, but implementing that would be an absolute nightmare, and it would likely give very unexpected and undesirable behavior, so I'll have to pass on that.

That's okay, I understand. Thank you for considering it!

No parsers/morphemizers are perfect, especially Japanese ones. Have you tried the spaCy Japanese models? It's trickier to setup, but it might give you better results, depending on your preferences. They tend to split morphs more aggressively, which can be nice if you like a more grammar approach rather than learn-through-volume.

I have not yet but I might try it. What I'm worried about is fixing one thing versus messing up another. If it splits things more aggressively, I might miss out on phrases like 気の毒 or 弱肉強食 but I guess I'll have to play with it and see.

I guess I'm just unsure what to do in those situations where the parser does mess up and I want to learn both the "parts" of the word and the sum total if those parts make up a new phrase.

mortii commented 5 months ago

Ohhh, I just realize I still have MeCab enabled because I was using it with Morphman. I wonder if that's part of the issue?

MorphMan silently loads mecab from one of these sources (in order):

  1. "MecabUnidic from addon MecabUnidic"
  2. "MecabUnidic from addon 13462835"
  3. "Japanese Support from addon 3918629684"
  4. "MIAJapaneseSupport from addon MIAJapaneseSupport"
  5. "Migaku Japanese support from addon 278530045"
  6. "From MorphMan"

These different versions of mecab can have significantly different results. AnkiMorphs only uses the equivalent of the 6th option, it does not load mecab from other sources. So yes, it's most likely the cause of the discrepancy.

I checked the top card, and it shows this: image

I do actually have 捕まえる as a card I marked as already known, but I would have thought AM would have still parsed とっ捕まえる as a separate unknown word.

Yeah, unexpected parsing is frustrating. Trying to fix these issues by changing morphemizer leads to a whack-a-mole scenario with no completely satisfying solution, unfortunately.

Btw, that pictures illustrates one of the problems with the #149 approach. Both なきゃ and ねえ will be shown as ない, which can be confusing and unproductive. I'll still add the feature, but yeah, I don't really recommend it.

I have not yet but I might try it. What I'm worried about is fixing one thing versus messing up another. If it splits things more aggressively, I might miss out on phrases like 気の毒 or 弱肉強食 but I guess I'll have to play with it and see.

I guess I'm just unsure what to do in those situations where the parser does mess up and I want to learn both the "parts" of the word and the sum total if those parts make up a new phrase.

Based on that description, I suspect that you will probably be better off using mecab instead of spaCy. The Japanese spaCy models splits words like 本当に into [本当, に] which is grammatically correct, but I prefer learning whole phrases instead of breaking them up as much as possible. We have a discussion about that in #115 if you are curious.

All these parsing problems should be fixed upstream imo. Allowing cherry picking in AnkiMorphs would open pandora's box, which I don't want to do, sorry :pray:

quietmansoath commented 5 months ago

These different versions of mecab can have significantly different results. AnkiMorphs only uses the equivalent of the 6th option, it does not load mecab from other sources. So yes, it's most likely the cause of the discrepancy.

Is there a good way to "reset" my deck then? That is to say, for me to disable the separate mecab add-on and have AnkiMorphs do a fresh reanalysis of all of my sentences?

Btw, that pictures illustrates one of the problems with the https://github.com/mortii/anki-morphs/issues/149 approach. Both なきゃ and ねえ will be shown as ない, which can be confusing and unproductive. I'll still add the feature, but yeah, I don't really recommend it.

That's fair. I do understand the logic of this, but I like to use my deck to study definitions more than grammar. I also have a field on my card for generic "notes" for those times I do want to add more differential details. In any case, I appreciate that you've given users the option now so that everyone can do what they think is best :)

All these parsing problems should be fixed upstream imo. Allowing cherry picking in AnkiMorphs would open pandora's box, which I don't want to do, sorry 🙏

Absolutely understood!

mortii commented 5 months ago

Is there a good way to "reset" my deck then? That is to say, for me to disable the separate mecab add-on and have AnkiMorphs do a fresh reanalysis of all of my sentences?

AnkiMorphs only uses it's own mecab that it comes bundled with, so in a sense, it has always done a fresh analysis; no other add-ons influences how the AnkiMorphs' mecab parses the text, that is only a Morphman "problem".

Sorry if that doesn't make sense, I can try to explain it in a different way if you want me to.

github-actions[bot] commented 4 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.