mortii / anki-morphs-mecab

0 stars 0 forks source link

ipadic lemma #1

Closed ConnorGCarr closed 2 months ago

ConnorGCarr commented 2 months ago

There are cases where the dictionary bundled with anki-morphs-mecab (ipadic) views some lemmas as distinct, where I think most users would view them as the same. The two cases I've noticed are:

  1. all-kana writings of a word that could also be written with kanji, e.g.

    • しゃべる and 喋る
    • あたりまえ and 当たり前
    • はじめて and 初めて
  2. the potential forms of verbs, mostly 五段, but ら抜き versions of 一段 verbs as well, e.g.

    • 食う and 食える
    • 走る and 走れる
    • 見る and 見れる

The are overlapping cases as well: しゃべる, 喋る, しゃべれる, and 喋れる are each counted as a distinct lemma. The result is anki-morphs queuing cards that don't have any true unknown lemmas, or withholding cards that anki-morphs thinks have multiple.

UniDic catches these as all being the same lemma. Is there any reason not to switch?

mortii commented 2 months ago

It's been a while since I worked on this, but IIRC the unidic version is almost identical to the japanese spacy morphemizer, especially in how aggressively it splits morphs, which a lot of people don't like.

UniDic catches these as all being the same lemma.

Have you tested this with spacy? It'd be interesting to see if it also gets the correct lemmas.