This package, RcppMeCab, is a Rcpp
wrapper for the part-of-speech morphological analyzer MeCab
. It supports native utf-8 encoding in C++ code and CJK (Chinese, Japanese, and Korean) MeCab library. This package fully utilizes the power Rcpp
brings R
computation to analyze texts faster.
__Please see this for easy installation and usage examples in Korean.__
pos()
will return a character vector, not a list.pos()
and posParallel()
return lists, not named lists. We decided to remove original texts in results, since it does not fit to R way.First, install MeCab
of your language-of-choice.
MeCab
from githubMeCab-Ko
from Bitbucket repositoryMeCab
and MeCab Chinese Dic
from MeCab-ChineseSecond, you can install RcppMeCab from CRAN with:
install.packages("RcppMeCab") # build from source
# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version
You should set the language you want to use for the analysis with the environment variable MECAB_LANG
. The default value is ko
and if you want to analyze Japanese or Chinese, please set it as ja
before install the package.
install.packages("RcppMeCab") # for installing Korean version
# or, install for Japanese
Sys.setenv(MECAB_LANG = 'ja') # for installing Japanese developmental version
install.packages("RcppMeCab", type="source") # build from source
# install.packages("devtools")
install_github("junhewk/RcppMeCab") # install developmental version
For analyzing, you also need MeCab binary and dictionary.
For Korean:
Install mecab-ko-msvc and mecab-ko-dic-msvc up to your 32-bit or 64-bit Windows version in C:\mecab
. Provide directory location to RcppMeCab
function.
Current mecab-ko-msvc
is not working in R. Please use mecab-ko-msvc
0.9.2 or lower.
For Japanese:
Install mecab binary. Provide directory location to RcppMeCab
function. For example: pos(sentence, sys_dic = "C:/PROGRA~2/mecab/dic/ipadic")
This package has pos
and posParallel
function.
pos(sentence) # returns list, sentence will present on the names of the list
pos(sentence, join = FALSE) # for yielding morphemes only (tags will be given on the vector names)
pos(sentence, format = "data.frame") # the result will returned as a data frame format
pos(sentence, user_dic) # gets a compiled user dictionary
posParallel(sentence, user_dic) # parallelized version uses more memory, but much faster than the loop in single threading
"data.frame"
, the function will return the result in a data frame format.dicrc
, model.bin
, and other files are located, default value is "" or you can set your default value using options(mecabSysDic = "")
mecab_dict_index
, default value is also ""You should not use simplified dictionary entry, e.g. tilde expression (~/). Please provide full path name in sys_dic
and user_dic
.
MeCab API has DictionaryCompiler
, but it contains die()
. Hence, calling it in Rcpp crashes down entire R session. This will not be included in RcppMeCab
functions.
Please refer to Mecab for Japanese.
You should have model_file
if you want the library to estimate cost automatically.
model.bin
in mecab-ko-dicYou need entire mecab-ko-dic
source if you want to compile Korean user dictionary. User dictionary should also be prepared in CSV file. CSV structure is found in Japanese and Korean.
Compile:
$ /usr/local/libexec/mecab/mecab-dict-index -m `model_file` -d `mecab_dic_location` -u `user_dictionary_file_name` -f `CSV file charset` -t `original dictionary charset` `target_csv
# example
$ /usr/local/libexec/mecab/mecab-dict-index -m /usr/local/lib/mecab/dic/mecab-ko-dic/model.bin -d ~/mecab-ko-dic-2.0.3-20170922 -u userdic.dic -f utf8 -t utf8 ~/person.csv
mecab-ko-msvc
has mecab-dict-index.exe
.MeCab
binary version has mecab-dict-index.exe
.You can use it in the same way the Linux binary compiles the dictionary.
Junhewk Kim (junhewk.kim@gmail.com)
Kato Akiru