kaegi / MorphMan

Anki plugin that reorders language cards based on the words you know
Other
258 stars 66 forks source link

For Chinese, we should be able to load user dictionary using Jieba #275

Open aash949 opened 2 years ago

aash949 commented 2 years ago

Jieba has a function to load a user's dictionary to make word segmentation more accurate to your dictionary of choice, i.e. the cc-cedict dictionary. Here's the function...

jieba.load_userdict(file_name)

I am proposing that, when Jieba is initialized, we check to see if there is a userdict.txt file in dbs (like frequency.txt) and, if there is a userdict.txt file, we use this function to load the contents of this file before implementing any word segmentation.

I haven't wrote much code since University but I'll check to see if I can implement this change myself.

ghost commented 1 year ago

Are there any news on this?

aash949 commented 1 year ago

Are there any news on this?

Implementing this and achieving the desired result (or at least my desired result) could be more complicated than I first thought.

If you load a user dictionary using Jieba before performing word segmentation, it will improve the word segmentation relative to your dictionary which is nice.

However, Jieba will continue to segment words the way it thinks words should be segmented rather than according to your dictionary.

What I find often happens is that Jieba thinks that two words with two separate dictionary entries in your dictionary are actually one longer word (i.e. a portmanteau) but that longer word isn't in your dictionary which can be a bit annoying if you would rather learn the words and their individual meanings separately.

I think the best way to resolve this is to load your dictionary, perform word segmentation, and then check word by word if each word is in your dictionary. If the word is not in your dictionary, use Jieba's del_word(word) function to delete the word (which is probably two words combined without a dictionary entry in your dictionary available) and then try word segmentation again to see if those two words are now separately segmented with a dictionary entry available for each.

I think this would slow things down a lot though.

Perhaps I'm overthinking this.