Closed ConnorGCarr closed 2 months ago
It's been a while since I worked on this, but IIRC the unidic version is almost identical to the japanese spacy morphemizer, especially in how aggressively it splits morphs, which a lot of people don't like.
UniDic catches these as all being the same lemma.
Have you tested this with spacy? It'd be interesting to see if it also gets the correct lemmas.
There are cases where the dictionary bundled with anki-morphs-mecab (ipadic) views some lemmas as distinct, where I think most users would view them as the same. The two cases I've noticed are:
all-kana writings of a word that could also be written with kanji, e.g.
the potential forms of verbs, mostly 五段, but ら抜き versions of 一段 verbs as well, e.g.
The are overlapping cases as well: しゃべる, 喋る, しゃべれる, and 喋れる are each counted as a distinct lemma. The result is anki-morphs queuing cards that don't have any true unknown lemmas, or withholding cards that anki-morphs thinks have multiple.
UniDic catches these as all being the same lemma. Is there any reason not to switch?