junhewk / RcppMeCab

RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab
24 stars 8 forks source link

Is there a easy way to switch between language models #3

Open koheiw opened 6 years ago

koheiw commented 6 years ago

I often analyze different languages in the same project. Is there a way to specify which language model to use (Japanese or Korean)? I could do this by changing sys_dic, but it would be easier if there is a lang argument based on which pos() switches models internally.

By the way I started advertising your package: https://koheiw.net/wp-content/uploads/2018/07/Asian-text-analysis.pdf

junhewk commented 6 years ago

It could be achieved by including dictionaries in the package, as many NLP packages are supporting by download.model or other functions corresponding to it. As MeCab dictionaries need not be installed (just configure && make will do basic things to load), I think it's possible to develop such a function in the package.

For that, I want to ask your opinion. Is IPA dictionary sufficient for analyzing Japanese? Or is it better to give users to select the dictionary he wants to download?

By the way, if the package downloads the dictionary, then the path of the dictionary will become very complex. When compiling a user dictionary, it's needed to put the location of the dictionary to the mecab-dict-index with arg -d. If we want to keep it short, then it's better to install MeCab dictionary in the designated location.

Thank you so much for advertising the package. I hope this package helps with your work!

koheiw commented 6 years ago

I think IPA is enough in most of the cases, but there are other popular dictionaries like this https://github.com/neologd/mecab-ipadic-neologd.

How about recording the locations of dictionaries in the package options for "cn", "ja" and "ko"? They should default to the dictionary that is in the package, but it allows users to use different dictionaries (which they manually download) without typing the path every time they use pos().

junhewk commented 5 years ago

Sorry to reply late. I'm developing lang arg with options() function. There are limitations for loading the default value, so I'm considering incorporate jsonlite package to save directories for each language.

By the way, could I get your advice? Is de-inflected form of word important in Japanese? Some Korean users asked me expanding the result to present de-inflected from, to use it for lemmatization. MeCab dictionaries provide it in its feature, so I just wonder if it is valuable for Japanese.

koheiw commented 5 years ago

Great that you are working on the lang option.

I think that many people use Mecab for lemmatization of Japanese words, but it is enough it pos() returns a data.frame with all the Mecab outputs.