Closed mortii closed 8 months ago
Out of curiosity, why are you removing mecab?
Out of curiosity, why are you removing mecab?
Because it is 20 MB of download when included by default, and the rest of AnkiMorphs is no more than 369 KB.
Because it is 20 MB of download when included by default, and the rest of AnkiMorphs is no more than 369 KB.
Makes sense. I've been noticing a bunch of differences between the way jieba and spaCy segment Chinese words. I have the impression that jieba is better, but I need to do a bit more investigation. Hopefully you'll do the Mecab work in a way that it will be easy to add jieba (or another parser) if looks like it's worth it.
Hopefully you'll do the Mecab work in a way that it will be easy to add jieba (or another parser) if looks like it's worth it.
Yeah, I've been thinking about that too for the last couple of days. Making the equivalent companion add-ons for the other morphemizers that comes bundled with morphman should hopefully be relatively easy.
This is probably too much info, but here are the results of my small jieba/spaCy Chinese investigation. I'm sure that this is extremely flawed, but I didn't find a scientific comparison. I took all the text from the files that I created my Anki cards from, and ran it through jieba and through spaCy. My proxy for measuring how good the parsers are is: "how many of the words that the parser outputs are in the chinese CC-CEDICT".
From my text files, jieba found different 8850 words. 6053 of them were in the dictionary. When I removed all the words that had numbers in them, I wound up with 5733 words in the dictionary. spaCy found 8295 words - a lot less. But 6004 of them were in the dictionary, and after taking out the words with numbers, 5751 of them were in the dictionary. Almost the same. Another interesting point is that there was only 77% overlap between the words from jieba and spaCy.
So in summary, while jieba and spaCy do give different results, it's not clear if one is actually better than the other for those of us who are using it to find the morphs to learn.
@xofm31 that's really cool, thanks!
The same thing applies to the Japanese morphemizers, they all have their own flaws, and switching from one to another to eliminate flaws leads to a whack-a-mole scenario.
I think it mostly boils down to personal preference at some point. For example, the Japanese spacy models aggressively split up words into as many parts as possible, which is great if you prefer a grammar oriented learning approach. MeCab is better at maintaining entire words, which I definitely prefer.
Yeah, I have the impression that jieba is better at keeping more complex words... Realizing that Pleco has a lot better word coverage than the open source dictionary (but not having access to the word list to programmatically test it out), I took a random 100 words that each of the parsers identified, and looked them up in Pleco. 21 of the jieba words were in the dictionary, and only 5 of the spacy words. So I do think that I'd stick with jieba if it's available.
For what it's worth, spacy says it can actually use Jieba on the models & languages page, though it's not clear from the Chinese models page which models that is supported by... I am having an issue getting spacy to work at all for Chinese now, something seems to have broken the pkuseg tokenizer on my system, made an issue for that here to see if anyone else has experienced it.
There are two other things. First, jieba is reliant on dictionaries, so I guess that has an impact on accuracy maybe, just as the model for pkuseg does. Second, it has three modes. From its (English) docs:
Accurate Mode attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Full Mode gets all the possible words from the sentence. Fast but not accurate. Search Engine Mode, based on the Accurate Mode, attempts to cut long words into several short words, which can raise the recall rate. Suitable for search engines.
My guess is Spacy might default to one you may not expect even when using Jieba.
v2.0.0 is now released. Let me know if something is broken.
EDIT: ankiweb is being a bit stubborn so it might take a little while before it gets updated.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Originally posted by @mortii in https://github.com/mortii/anki-morphs/discussions/172#discussioncomment-8661329
This means that we have to make a v2 version. It's very important to handle the backwards compatibility break in a graceful way that doesn't crash Anki, nor cause other problems that will be very annoying to the users.
We should also rename the
am-difficulty
extra field toam-score
while we are at it., that way we don't have to break backwards compatibility twice.Todo list:
mecab_wrapper.py
.addon
fileankimorphs-japanese-mecab
.addon
file one final timedeps
directoryskip cards with only known morphs
option has a hidden side effect of moving stale cards to the end of the queue on recalc, so maybe split it into two separate options?difficulty
toscore
am-difficulty
extra field toam-score
difficulty
toscore
in all the codedifficulty
->score
everywhere