Open koheiw opened 6 years ago
It could be achieved by including dictionaries in the package, as many NLP packages are supporting by download.model
or other functions corresponding to it. As MeCab dictionaries need not be installed (just configure && make
will do basic things to load), I think it's possible to develop such a function in the package.
For that, I want to ask your opinion. Is IPA dictionary sufficient for analyzing Japanese? Or is it better to give users to select the dictionary he wants to download?
By the way, if the package downloads the dictionary, then the path of the dictionary will become very complex. When compiling a user dictionary, it's needed to put the location of the dictionary to the mecab-dict-index
with arg -d
. If we want to keep it short, then it's better to install MeCab dictionary in the designated location.
Thank you so much for advertising the package. I hope this package helps with your work!
I think IPA is enough in most of the cases, but there are other popular dictionaries like this https://github.com/neologd/mecab-ipadic-neologd.
How about recording the locations of dictionaries in the package options for "cn", "ja" and "ko"? They should default to the dictionary that is in the package, but it allows users to use different dictionaries (which they manually download) without typing the path every time they use pos()
.
Sorry to reply late. I'm developing lang
arg with options()
function.
There are limitations for loading the default value, so I'm considering incorporate jsonlite package to save directories for each language.
By the way, could I get your advice? Is de-inflected form of word important in Japanese? Some Korean users asked me expanding the result to present de-inflected from, to use it for lemmatization. MeCab dictionaries provide it in its feature, so I just wonder if it is valuable for Japanese.
Great that you are working on the lang
option.
I think that many people use Mecab for lemmatization of Japanese words, but it is enough it pos()
returns a data.frame with all the Mecab outputs.
I often analyze different languages in the same project. Is there a way to specify which language model to use (Japanese or Korean)? I could do this by changing
sys_dic
, but it would be easier if there is alang
argument based on whichpos()
switches models internally.By the way I started advertising your package: https://koheiw.net/wp-content/uploads/2018/07/Asian-text-analysis.pdf