mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
GNU Affero General Public License v3.0
61 stars 9 forks source link

Make a companion add-on for the chinese jieba morphemizer #193

Closed mortii closed 7 months ago

mortii commented 7 months ago

I've been noticing a bunch of differences between the way jieba and spaCy segment Chinese words. I have the impression that jieba is better, but I need to do a bit more investigation. Hopefully you'll do the Mecab work in a way that it will be easy to add jieba (or another parser) if looks like it's worth it.

Originally posted by @xofm31 in https://github.com/mortii/anki-morphs/issues/174#issuecomment-1983442509

Making the equivalent companion add-ons for the other morphemizers that comes bundled with morphman should hopefully be relatively easy.

Originally posted by @mortii in https://github.com/mortii/anki-morphs/issues/174#issuecomment-1983640570

Todo list:

xofm31 commented 7 months ago

I am really excited about this! I can't believe how fast you are able to add new features to Ankimorphs.

mortii commented 7 months ago

released in v2.1.0.

It should be identical to morphman, but I'm not able to test it extensively because:

  1. I can't use morphman
  2. I don't know chinese
  3. I don't have a lot of chinese cards

So any feedback would be welcome!

xofm31 commented 7 months ago

I've done some spot checking, and I can confirm that the morphs identified are the same ones that Morphman finds. Before doing anything, I deleted my ankimorphs.db to ensure that everything it was populating was new.

But Ankimorphs says that I know a lot more morphs than Morphman does. Trying to track this down, I have a card with the text: 我没想瞒着任何人的 am-highlighted looks like this: <span morph-status="known">我</span><span morph-status="known">没</span><span morph-status="known">想</span><span morph-status="unknown">瞒</span><span morph-status="known">着</span><span morph-status="known">任何人</span><span morph-status="known">的</span> The word 任何人 shows up in Ankimorphs as known, but it is unknown with Morphman. I can confirm with an Anki search that there are no cards with that morph that are not new.

I then changed back to spaCy. In order to get am-highlighted to update, I had to delete the ankimorphs.db again. Here is what it looks like with spaCy: <span morph-status="known">我</span><span morph-status="known">没</span><span morph-status="known">想</span><span morph-status="unknown">瞒</span><span morph-status="known">着</span><span morph-status="known">任何</span><span morph-status="known">人</span><span morph-status="known">的</span>

As you can see, spaCy has it separated into two morphs. Both of those individual morphs are on known cards: 遇到任何困难 and just . I'm not sure if it is coincidence, or if there is some remaining pointer to spaCy or there is another database that I need to delete?

mortii commented 7 months ago

The word 任何人 shows up in Ankimorphs as known, but it is unknown with Morphman. I can confirm with an Anki search that there are no cards with that morph that are not new.

I'm very tired, so parsing multiple negations is hard at the moment, but you are saying that AnkiMorphs: Chinese correctly identifies 任何人 as known, but MorphMan does not?

I'm not sure if it is coincidence, or if there is some remaining pointer to spaCy or there is another database that I need to delete?

No, only ankimorphs.db. However, speaking from experience, switching between morphemizers can lead to some cards being marked as known with one morphemizer, and the known tag sticks even if you switch to another morphemizer, so you can get a weird pattern where previously unknown cards/morphs become known. Removing all am-known-automatically tags from all cards should be safe, and should hopefully fix that kind of problem.

xofm31 commented 7 months ago

I'm very tired, so parsing multiple negations is hard at the moment, but you are saying that AnkiMorphs: Chinese correctly identifies 任何人 as known, but MorphMan does not?

No, it should have been unknown but AnkiMorphs said it was known.

However, speaking from experience, switching between morphemizers can lead to some cards being marked as known with one morphemizer, and the known tag sticks even if you switch to another morphemizer, so you can get a weird pattern where previously unknown cards/morphs become known

This was the problem. Removing the tags fixed the problem. Thanks!

mortii commented 7 months ago

No, it should have been unknown but AnkiMorphs said it was known.

If you can send me these files from your anki profile folder:

then I can help debug this if you want.

xofm31 commented 7 months ago

Sorry I wasn't clear. Removing all of the tags & doing a recalc fixed all my issues. Thanks!

github-actions[bot] commented 7 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.