kaegi / MorphMan

Anki plugin that reorders language cards based on the words you know
Other
262 stars 66 forks source link

Add Vietnamese support using pyvi #113

Open kurtisc opened 4 years ago

kurtisc commented 4 years ago

Hi!

Vietnamese doesn't separate words with spaces like most other languages that use the Latin alphabet[1], so the current spaces morphemizer is unsuitable.

[1] Fun read https://www.tandfonline.com/doi/pdf/10.1080/00437956.1963.11659787

I wasn't able to find a small library that would do word segmentation for Vietnamese like Jieba does for Chinese. To bundle pyvi in-code like Jieba has been bundled would require bundling many larger dependencies (e.g. Numpy).

So, if merged like this, it's unfortunately a burden on the end user to get the Vietnamese support working. On the other hand, if they don't want it, it won't appear or impact their usage.

If this gets included I'll look into packaging pyvi and it's dependencies as a separate addon like has been done for Mecab, licences permitting. That would make the installation more straight-forward and avoid forcing use of the source version of Anki.

kurtisc commented 4 years ago

Rebased on master and confirmed working when #125 is merged.

With regards to #145: I do have a test for this morphemizer, so hopefully that fulfils @shanrauf's comment.

ianki commented 4 years ago

Would you mind rebasing again, so I can see if the tests pass? I'll submit after.

smartlitchi commented 4 years ago

I am really interested in this

sedosido commented 3 years ago

I haven’t been able to build anki from scratch to import pyvi (I think because my hardware is a little old). Is there any other way I can get vietnamese parsing to work with morphman?