mortii / anki-morphs

A MorphMan fork rebuilt from the ground up with a focus on simplicity, performance, and a codebase with minimal technical debt.
https://mortii.github.io/anki-morphs/
GNU Affero General Public License v3.0
64 stars 10 forks source link

Remove Mecab from AnkiMorphs (v2 megathread) #174

Closed mortii closed 8 months ago

mortii commented 9 months ago

With that out of the way, your point about the add-on being too big is valid. For me it takes less than a second to download, but that will not be the case for everyone, and a size of 1KB would of course be much preferable to 21MB.

ianki already made an add-on that only contains mecab which could then be imported to morphman and used as a morphemizer, and we could do the same thing in ankimorphs. It would break backwards compatibility, but it might be worth it in the long run.

Originally posted by @mortii in https://github.com/mortii/anki-morphs/discussions/172#discussioncomment-8661329

This means that we have to make a v2 version. It's very important to handle the backwards compatibility break in a graceful way that doesn't crash Anki, nor cause other problems that will be very annoying to the users.

We should also rename the am-difficulty extra field to am-score while we are at it., that way we don't have to break backwards compatibility twice.

Todo list:

xofm31 commented 8 months ago

Out of curiosity, why are you removing mecab?

aleksejrs commented 8 months ago

Out of curiosity, why are you removing mecab?

Because it is 20 MB of download when included by default, and the rest of AnkiMorphs is no more than 369 KB.

xofm31 commented 8 months ago

Because it is 20 MB of download when included by default, and the rest of AnkiMorphs is no more than 369 KB.

Makes sense. I've been noticing a bunch of differences between the way jieba and spaCy segment Chinese words. I have the impression that jieba is better, but I need to do a bit more investigation. Hopefully you'll do the Mecab work in a way that it will be easy to add jieba (or another parser) if looks like it's worth it.

mortii commented 8 months ago

Hopefully you'll do the Mecab work in a way that it will be easy to add jieba (or another parser) if looks like it's worth it.

Yeah, I've been thinking about that too for the last couple of days. Making the equivalent companion add-ons for the other morphemizers that comes bundled with morphman should hopefully be relatively easy.

xofm31 commented 8 months ago

This is probably too much info, but here are the results of my small jieba/spaCy Chinese investigation. I'm sure that this is extremely flawed, but I didn't find a scientific comparison. I took all the text from the files that I created my Anki cards from, and ran it through jieba and through spaCy. My proxy for measuring how good the parsers are is: "how many of the words that the parser outputs are in the chinese CC-CEDICT".

From my text files, jieba found different 8850 words. 6053 of them were in the dictionary. When I removed all the words that had numbers in them, I wound up with 5733 words in the dictionary. spaCy found 8295 words - a lot less. But 6004 of them were in the dictionary, and after taking out the words with numbers, 5751 of them were in the dictionary. Almost the same. Another interesting point is that there was only 77% overlap between the words from jieba and spaCy.

So in summary, while jieba and spaCy do give different results, it's not clear if one is actually better than the other for those of us who are using it to find the morphs to learn.

mortii commented 8 months ago

@xofm31 that's really cool, thanks!

The same thing applies to the Japanese morphemizers, they all have their own flaws, and switching from one to another to eliminate flaws leads to a whack-a-mole scenario.

I think it mostly boils down to personal preference at some point. For example, the Japanese spacy models aggressively split up words into as many parts as possible, which is great if you prefer a grammar oriented learning approach. MeCab is better at maintaining entire words, which I definitely prefer.

xofm31 commented 8 months ago

Yeah, I have the impression that jieba is better at keeping more complex words... Realizing that Pleco has a lot better word coverage than the open source dictionary (but not having access to the word list to programmatically test it out), I took a random 100 words that each of the parsers identified, and looked them up in Pleco. 21 of the jieba words were in the dictionary, and only 5 of the spacy words. So I do think that I'd stick with jieba if it's available.

ashprice commented 8 months ago

For what it's worth, spacy says it can actually use Jieba on the models & languages page, though it's not clear from the Chinese models page which models that is supported by... I am having an issue getting spacy to work at all for Chinese now, something seems to have broken the pkuseg tokenizer on my system, made an issue for that here to see if anyone else has experienced it.

There are two other things. First, jieba is reliant on dictionaries, so I guess that has an impact on accuracy maybe, just as the model for pkuseg does. Second, it has three modes. From its (English) docs:

Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Full Mode gets all the possible words from the sentence. Fast but not accurate. Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.

My guess is Spacy might default to one you may not expect even when using Jieba.

mortii commented 8 months ago

v2.0.0 is now released. Let me know if something is broken.

EDIT: ankiweb is being a bit stubborn so it might take a little while before it gets updated.

github-actions[bot] commented 8 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.