junhewk / RcppMeCab

RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab
24 stars 8 forks source link

Needs test on Japanese. #1

Open haven-jeon opened 6 years ago

haven-jeon commented 6 years ago

It's now on cran : https://CRAN.R-project.org/package=RcppMeCab

Need to check RcppMeCab results using Japanese.

@koheiw, Could you help on this? Any ideas?

koheiw commented 6 years ago

Hi @haven-jeon and @junhewk ! I am impressed how quickly you can publish your package in CRAN. I run a test on Windows and Linux. It seems that it has some issue with Windows (or I am doing things worng). It works smoothly on Linux, but output of pos() seems too large (posParallel() looks fine).

Windows

> require(quanteda)
> require(RcppMeCab)
> #devtools::install_github("quanteda/quanteda.corpora")
> require(quanteda.corpora)
> 
> corp <- download("data_corpus_foreignaffairscommittee")
> txt <- tail(texts(corp), 1000)
> 
> pos(txt[1], join = FALSE)
Exception: 
Error in posRcpp(sentence, sys_dic, user_dic) : 
  Not compatible with STRSXP: [type=NULL].
> pos(txt[1], join = TRUE, sys_dic = "C:\Program Files (x86)\MeCab\dic\ipadic")
Error: '\P' is an unrecognized escape in character string starting ""C:\P"
> pos(txt[1], join = TRUE, sys_dic = "C://Program Files (x86)//MeCab//dic//ipadic")
Exception: 
Error in posJoinRcpp(sentence, sys_dic, user_dic) : 
  Not compatible with STRSXP: [type=NULL].
> pos(txt[1], join = TRUE, sys_dic = "C:/Program Files (x86)/MeCab/dic/ipadic")
Exception: 
Error in posJoinRcpp(sentence, sys_dic, user_dic) : 
  Not compatible with STRSXP: [type=NULL].
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda.corpora_0.85 RcppMeCab_0.0.1.1     quanteda_1.3.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       knitr_1.20         magrittr_1.5       stopwords_0.9.0    munsell_0.5.0      colorspace_1.3-2  
 [7] lattice_0.20-35    rlang_0.2.1        fastmatch_1.1-0    stringr_1.3.1      plyr_1.8.4         tools_3.5.0       
[13] grid_3.5.0         data.table_1.11.4  gtable_0.2.0       xfun_0.2           spacyr_0.9.9       htmltools_0.3.6   
[19] RcppParallel_4.4.0 yaml_2.1.19        lazyeval_0.2.1     rprojroot_1.3-2    digest_0.6.15      tibble_1.4.2      
[25] bookdown_0.7       Matrix_1.2-14      ggplot2_2.2.1      evaluate_0.10.1    rmarkdown_1.10     blogdown_0.6      
[31] stringi_1.1.7      pillar_1.2.3       compiler_3.5.0     scales_0.5.0       backports_1.1.2    lubridate_1.7.4   

Linux

> require(quanteda)
> require(RcppMeCab)
> #devtools::install_github("quanteda/quanteda.corpora")
> require(quanteda.corpora)
> 
> corp <- download("data_corpus_foreignaffairscommittee")
> txt <- tail(texts(corp), 1000)
> 
> out <- posParallel(txt)
> object.size(out)
7046928 bytes
> 
> out2 <- pos(txt)
> object.size(out2)
1005692696 bytes
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User Edition 5.13

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
 [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda.corpora_0.85 RcppMeCab_0.0.1.1     quanteda_1.3.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       RMeCab_0.99999     knitr_1.20         magrittr_1.5       stopwords_0.9.0   
 [6] munsell_0.5.0      colorspace_1.3-2   lattice_0.20-35    rlang_0.2.1        fastmatch_1.1-0   
[11] stringr_1.3.1      plyr_1.8.4         tools_3.4.4        grid_3.4.4         data.table_1.11.4 
[16] gtable_0.2.0       xfun_0.2           spacyr_0.9.91      htmltools_0.3.6    RcppParallel_4.4.0
[21] yaml_2.1.19        lazyeval_0.2.1     rprojroot_1.3-2    digest_0.6.15      tibble_1.4.2      
[26] bookdown_0.7       Matrix_1.2-14      ggplot2_2.2.1      evaluate_0.10.1    rmarkdown_1.10    
[31] blogdown_0.6       stringi_1.2.3      pillar_1.2.3       compiler_3.4.4     scales_0.5.0      
[36] backports_1.1.2    lubridate_1.7.4

The package looks good already, but my suggesting/request is to

koheiw commented 6 years ago

I will present at TokyoR on 15th next month. I will highlight your package in my talk. There will be people from RStudio too.

junhewk commented 6 years ago

Thank you so much, @haven-jeon and @koheiw ! I tested and revised the package.

  1. I fixed args of routines to sys_dic() working properly.
  2. There was a glitch in the pos() function. I fixed it.
  3. Now input character vectors are presented on the names attribute of the result list.
  4. I made options(mecabSysDic=) to preserve user preference of MeCab system dictionary.
> library(quanteda)
> library(quanteda.corpora)
> library(RcppMeCab)
> 
> corp <- download("data_corpus_foreignaffairscommittee")
> txt <- tail(texts(corp), 1000)
>
> out <- posParallel(txt)
> out2 <- pos(txt)
>
> object.size(out)
8831024 bytes
> object.size(out2)
8831024 bytes
>
> out[1000]
$`○三ッ矢委員長 以上で説明は終わりました。\n 次回は、公報をもってお知らせすることとし、本日は、これにて散会いたします。\n    午後一時十一分散会\n`
 [1] "○/記号"        "三ッ矢/名詞"   "委員/名詞"     "長/名詞"       " /記号"       "以上/名詞"    
 [7] "で/助詞"       "説明/名詞"     "は/助詞"       "終わり/動詞"   "まし/助動詞"   "た/助動詞"    
[13] "。/記号"       " /記号"       "次回/名詞"     "は/助詞"       "、/記号"       "公報/名詞"    
[19] "を/助詞"       "もっ/動詞"     "て/助詞"       "お知らせ/名詞" "する/動詞"     "こと/名詞"    
[25] "と/助詞"       "し/動詞"       "、/記号"       "本日/名詞"     "は/助詞"       "、/記号"      
[31] "これ/名詞"     "にて/助詞"     "散会/名詞"     "いたし/動詞"   "ます/助動詞"   "。/記号"      
[37] " /記号"       " /記号"       " /記号"       " /記号"       "午後/名詞"     "一/名詞"      
[43] "時/名詞"       "十/名詞"       "一/名詞"       "分/名詞"       "散会/名詞"    

> out2[1000]
$`○三ッ矢委員長 以上で説明は終わりました。\n 次回は、公報をもってお知らせすることとし、本日は、これにて散会いたします。\n    午後一時十一分散会\n`
 [1] "○/記号"        "三ッ矢/名詞"   "委員/名詞"     "長/名詞"       " /記号"       "以上/名詞"    
 [7] "で/助詞"       "説明/名詞"     "は/助詞"       "終わり/動詞"   "まし/助動詞"   "た/助動詞"    
[13] "。/記号"       " /記号"       "次回/名詞"     "は/助詞"       "、/記号"       "公報/名詞"    
[19] "を/助詞"       "もっ/動詞"     "て/助詞"       "お知らせ/名詞" "する/動詞"     "こと/名詞"    
[25] "と/助詞"       "し/動詞"       "、/記号"       "本日/名詞"     "は/助詞"       "、/記号"      
[31] "これ/名詞"     "にて/助詞"     "散会/名詞"     "いたし/動詞"   "ます/助動詞"   "。/記号"      
[37] " /記号"       " /記号"       " /記号"       " /記号"       "午後/名詞"     "一/名詞"      
[43] "時/名詞"       "十/名詞"       "一/名詞"       "分/名詞"       "散会/名詞"    

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RcppMeCab_0.0.1.2     quanteda.corpora_0.85 quanteda_1.3.0       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17       magrittr_1.5       devtools_1.13.5    stopwords_0.9.0    munsell_0.5.0     
 [6] colorspace_1.3-2   lattice_0.20-35    R6_2.2.2           rlang_0.2.1        fastmatch_1.1-0   
[11] stringr_1.3.1      httr_1.3.1         plyr_1.8.4         tools_3.5.0        grid_3.5.0        
[16] data.table_1.11.4  gtable_0.2.0       spacyr_0.9.9       git2r_0.21.0       withr_2.1.2       
[21] lazyeval_0.2.1     RcppParallel_4.4.0 digest_0.6.15      tibble_1.4.2       Matrix_1.2-14     
[26] ggplot2_2.2.1      curl_3.2           memoise_1.1.0      stringi_1.2.2      compiler_3.5.0    
[31] pillar_1.2.3       scales_0.5.0       lubridate_1.7.4   

Could you try this revised version on Github, @koheiw ? You should put Sys.setenv() before installing Japanese DLLs for the package.

Sys.setenv(MECAB_LANG='jp')
devtools::install_github("junhewk/RcppMeCab")
koheiw commented 6 years ago

Sounds promising, but there is system dependency. On my Windows, installation from github fails. How dose the package find the location of Mecab?

> Sys.setenv(MECAB_LANG='jp')
> devtools::install_github("junhewk/RcppMeCab")

-lR
installing to C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab/libs/x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
  converting help for package 'RcppMeCab'
    finding HTML links ... done
    RcppMeCab                               html  
    pos                                     html  
    posParallel                             html  
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
Error: package or namespace load failed for 'RcppMeCab' in inDL(x, as.logical(local), as.logical(now), ...):
 unable to load shared object 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab/libs/x64/RcppMeCab.dll':
  LoadLibrary failure:  The specified module could not be found.

Error: loading failed
Execution halted
ERROR: loading failed for 'x64'
* removing 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab'
* restoring previous 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab'
In R CMD INSTALL

By the way, Sys.setenv(MECAB_LANG='ja') would be more appropriate because 'jp' is Japan's country code.

junhewk commented 6 years ago

I changed the environment variable as you said. Now,

Sys.setenv(MECAB_LANG="ja")

will work.

I also tried installing the package from the Github in my Windows 10 via Parallels (in below, I removed the compiling messages).

> Sys.setenv(MECAB_LANG="ja")
> devtools::install_github("junhewk/RcppMeCab")
Downloading GitHub repo junhewk/RcppMeCab@master
from URL https://api.github.com/repos/junhewk/RcppMeCab/zipball/master
Installing RcppMeCab
"C:/PROGRA~1/R/R-35~1.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD  \
  INSTALL "C:/Users/jk/AppData/Local/Temp/RtmpkXLar3/devtools1becc323f16/junhewk-RcppMeCab-d7b786b"  \
  --library="C:/Users/jk/Documents/R/win-library/3.5" --install-tests 

* installing *source* package 'RcppMeCab' ...
** libs

*** arch - i386
*** arch - x64
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (RcppMeCab)
In R CMD INSTALL

The R package installer downloads precompiled MeCab DLL files (it is located in the Release) before compiling each cpp files. If MECAB_LANG value is "ja", then the installer will download mecab32_ja.tar.gz and mecab64_ja.tar.gz to i386 and x64 subfolder of compiling directory. In Windows, the package could be installed without the library (surely the library will be needed when the user want to run functions).

It passes CRAN win-builder (and for the update, I didn't change anything of Makevars.win file). I'm so sorry to ask, but could you specify your Windows installing environment?

koheiw commented 6 years ago

No worries. I am happy to do more tests. The package compiles if I build the package from the source. I think these warnings explain the installation failure.

* removing 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab'
* restoring previous 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab'
Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, copy.date = TRUE) :
  problem copying C:\Users\Kohei\Documents\R\win-library\3.5\00LOCK-junhewk-RcppMeCab-510d53c\RcppMeCab\libs\x64\libmecab.dll to C:\Users\Kohei\Documents\R\win-library\3.5\RcppMeCab\libs\x64\libmecab.dll: Permission denied
Warning in file.copy(lp, dirname(pkgdir), recursive = TRUE, copy.date = TRUE) :
  problem copying C:\Users\Kohei\Documents\R\win-library\3.5\00LOCK-junhewk-RcppMeCab-510d53c\RcppMeCab\libs\x64\RcppMeCab.dll to C:\Users\Kohei\Documents\R\win-library\3.5\RcppMeCab\libs\x64\RcppMeCab.dll: Permission denied

I think these are the environmental variables you need.

R_ARCH                              /x64
R_COMPILED_BY                       gcc 4.9.3
R_DOC_DIR                           C:/PROGRA~1/R/R-35~1.0/doc
R_HOME                              C:/PROGRA~1/R/R-35~1.0
R_LIBS_USER                         C:/Users/Kohei/Documents/R/win-library/3.5
R_USER                              C:/Users/Kohei/Documents
RCPP_PARALLEL_NUM_THREADS           2
READYAPPS                           C:\ProgramData\Lenovo\ReadyApps
RMARKDOWN_MATHJAX_PATH              C:/Program Files/RStudio/resources/mathjax-26
RS_LOCAL_PEER                       \\.\pipe\34851-rsession
RS_RPOSTBACK_PATH                   C:/Program Files/RStudio/bin/rpostback
RS_SHARED_SECRET                    63341846741
RSTUDIO                             1
RSTUDIO_CONSOLE_COLOR               256
RSTUDIO_CONSOLE_WIDTH               80
RSTUDIO_MSYS_SSH                    C:/Program Files/RStudio/bin/msys-ssh-1000-18
RSTUDIO_PANDOC                      C:/Program Files/RStudio/bin/pandoc
RSTUDIO_SESSION_PORT                34851
RSTUDIO_USER_IDENTITY               Kohei
RSTUDIO_WINUTILS                    C:/Program Files/RStudio/bin/winutils
junhewk commented 6 years ago

I searched about the matter you mentioned, and found this discussion in SO:1 and SO:2. They said that it happens with antivirus protection or account authority. Downloading the source and install_local() would solve the problem, I believe. Could you try this on?

I'm working with return the result as a data.frame, following your proposal. Thanks for your help!

koheiw commented 6 years ago

I could solve the issue only by restarting R session before installing (I have no third-parity anti virus software on this machine). Installation goes very close to the end now, but there is one more error to tackle.

Error: package or namespace load failed for 'RcppMeCab' in inDL(x, as.logical(local), as.logical(now), ...):
 unable to load shared object 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab/libs/x64/RcppMeCab.dll':
  LoadLibrary failure:  The specified module could not be found.

Error: loading failed
Execution halted
ERROR: loading failed for 'x64'
* removing 'C:/Users/Kohei/Documents/R/win-library/3.5/RcppMeCab'

I also noticed that the installer downloads mecab32 instead of 64. I wonder if this this related to the above error. Here I am setting a random string xxx to trigger the download error message.

Error in download.file(url = "https://github.com/junhewk/RcppMeCab/releases/download/0.0.1.0/mecab32_xxx.tar.gz",  : 
  cannot open URL 'https://github.com/junhewk/RcppMeCab/releases/download/0.0.1.0/mecab32_xxx.tar.gz'
junhewk commented 6 years ago

Thank you so much for your time.

Since I can't reproduce the error, I depend on the search; This Github comment could be helpful. How about Mingw? Rtools35 might work, I guess.

For the second issue, I changed Makevars.win to force ko or ja for MECAB_LANG variable.

koheiw commented 6 years ago

Thanks. It works on one of the Windows machines!

> require(quanteda)
> require(RcppMeCab)
> require(quanteda.corpora)
> corp <- download("data_corpus_foreignaffairscommittee")
> txt <- tail(texts(corp), 1000)
> pos(txt[1], join = FALSE)
$...
        記号         名詞         記号         名詞         記号         名詞         記号         名詞         助詞         動詞 
         "○"       "宮本"         "("         "徹"         ")"       "委員"         " "       "内容"   "について"   "差し控え" 
        動詞         助詞         助詞       助動詞         助詞         記号         名詞         動詞       助動詞         助詞 
      "させ"         "て"       "じゃ"       "なく"         "て"         "、"       "確認"         "す"       "べき"       "じゃ" 
      助動詞         助詞         助詞         名詞         助詞         動詞         助詞         動詞         名詞       助動詞 
      "ない"         "か"     "という"       "こと"         "を"       "言っ"         "て"       "いる"       "わけ"       "です" 
        助詞         記号         記号         名詞         助詞       連体詞         名詞         助詞         動詞         助詞 
        "よ"         "。"         " "       "沖縄"         "の"       "あの"       "事故"         "を"       "受け"         "て" 
        記号         名詞         助詞         記号       連体詞         名詞         助詞         名詞         名詞         助詞 
        "、"     "皆さん"         "が"         "、"       "その"       "運用"         "の"       "安全"         "性"         "を" 
        名詞         動詞         助詞         動詞         記号         名詞         動詞         助詞         動詞         助詞 
      "確認"         "し"         "て"       "いる"         "、"       "確認"         "し"         "て"       "いる"     "という" 
        名詞         助詞         動詞         助詞         動詞         名詞       助動詞         助詞         記号         副詞 
      "こと"         "を"       "言っ"         "て"       "いる"       "わけ"       "です"   "けれども"         "、"       "実際" 
        動詞         助詞         動詞       助動詞         名詞         名詞         助詞         名詞         助詞         名詞 
        "出"         "て"         "き"         "た"         "米"         "軍"         "の" "マニュアル"     "という"         "の" 
        助詞         記号         名詞         名詞         名詞         助詞         名詞         名詞       助動詞         名詞 
        "は"         "、"       "空中"       "給油"       "訓練"         "で"       "破滅"         "的"         "な"       "影響" 
        助詞         名詞         助詞         動詞         動詞         名詞         助詞         動詞         名詞       助動詞 
        "が"       "結果"     "として"   "もたらさ"       "れる"       "危険"         "が"       "ある"         "ん"         "だ" 
        助詞         名詞         助詞         動詞         動詞         助詞         動詞       助動詞         名詞       助動詞 
    "という"         "の"         "が"       "書か"         "れ"         "て"         "い"         "た"       "わけ"       "です" 
        助詞         助詞         記号         名詞         助詞         記号         副詞         記号         名詞         助詞 
        "よ"         "ね"         "。"       "それ"         "を"         "、"       "なぜ"         "、"   "アメリカ"         "に" 
        名詞         名詞         助詞         名詞         助詞         動詞         助詞         副詞         動詞         助詞 
    "自衛隊"         "員"         "の"     "皆さん"         "が"       "行っ"         "て"       "実際"         "見"         "て" 
        動詞         助詞         助詞         動詞       助動詞         記号         名詞         助詞         動詞       助動詞 
      "いる"         "に"         "も"   "かかわら"         "ず"         "、"       "それ"         "を"     "つかも"         "う" 
        助詞         動詞       助動詞         名詞         助詞         記号         名詞         助詞         動詞       助動詞 
        "と"         "し"       "ない"         "の"         "か"         "、"       "そこ"         "が"     "わから"       "ない" 
      助動詞         助詞         記号         記号         動詞       助動詞       助動詞         名詞       助動詞         助詞 
      "です"         "よ"         "。"         " "       "知り"       "たく"       "ない"         "ん"       "です"         "か" 
        記号         名詞         名詞         助詞         名詞         助詞         動詞       助動詞       助動詞         助詞 
        "。"       "危険"         "性"         "に"         "目"         "を"       "向け"       "たく"       "ない"     "という" 
        名詞       助動詞         助詞         記号         副詞       助動詞         助詞         記号         名詞       接頭詞 
      "こと"       "です"         "か"         "。"       "どう"       "です"         "か"         "、"       "若宮"         "副" 
        名詞         記号 
      "大臣"         "。"
junhewk commented 6 years ago

Cool! So happy to hear that.

I also want your advice about format of the resulting data frame.

This is what I got for the temporary version:

> library(corpus)
> temp <- "○三ッ矢委員長 以上で説明は終わりました。\n 次回は、公報をもってお知らせすることとし、本日は、これにて散会いたします。\n    午後一時十一分散会\n"
> print.corpus_frame(as.data.frame(pos(c(txt1=temp), sys_dic="", user_dic="")))
   doc_id sentence_id token_id token    pos    subtype 
1  txt1             1        1 ○        記号   一般    
2  txt1             1        2 三ッ矢   名詞   固有名詞
3  txt1             1        3 委員     名詞   一般    
4  txt1             1        4 長       名詞   接尾    
5  txt1             1        5 以上     名詞   非自立  
6  txt1             1        6 で       助詞   格助詞  
7  txt1             1        7 説明     名詞   サ変接続
8  txt1             1        8 は       助詞   係助詞  
9  txt1             1        9 終わり   動詞   自立    
10 txt1             1       10 まし     助動詞         
11 txt1             1       11 た       助動詞         
12 txt1             1       12 。       記号   句点    
13 txt1             2        1 次回     名詞   副詞可能
14 txt1             2        2 は       助詞   係助詞  
15 txt1             2        3 、       記号   読点    
16 txt1             2        4 公報     名詞   一般    
17 txt1             2        5 を       助詞   格助詞  
18 txt1             2        6 もっ     動詞   自立    
19 txt1             2        7 て       助詞   接続助詞
20 txt1             2        8 お知らせ 名詞   サ変接続
...

(I used corpus library to print UTF-8 characters in data frame correctly in Windows.)

As you may know, MeCab returns several values for the morpheme: 品詞,品詞細分類1,品詞細分類2,品詞細分類3,活用型,活用形,原形,読み,発音 I used 品詞 and 品詞細分類1 for the temporary output. (In Korean version, this is a part-of-speech value and its subtype.) Is it okay for analyzing Japanese? The problem is, Korean and Japanese MeCab result is different, so I should compromise.

DrMaphuse commented 5 years ago

Hi! I believe I've managed to install your package, but I get an error when I try to run pos() with Japanese text:

> pos("これはぺんです", join = FALSE)
> Exception: 
> list()
> Error in print.function(args(obj)) : 
>   invalid multibyte string at '<ff><fe><61>ny<ff><fe>") 

Could this be a problem related to character encoding?

junhewk commented 5 years ago

Hi @DrMaphuse , I think it is an encoding problem, but I can't reproduce it in my environment. In pos, there's no print function, hence, it might be a problem in the R environment when the console tries to print the result.

Can you save the result by result <- pos("これはぺんです", join = FALSE)? I also recommend using iconv("これはぺんです", from="SHIFT-JIS", to="UTF-8"). RcppMeCab gets a character vector directly from R (via Rcpp vector type), processes the string, and returns the result with UTF-8 encoding.

DrMaphuse commented 5 years ago

Thanks for your input! I have tried your suggestion, but unfortunately, the output is a List of 0. Regarding the iconv(), is this necessary even if my MeCab and my R script files are already in UTF-8?

junhewk commented 5 years ago

That's a little strange. I thought that your input environment was SHIFT-JIS or some other Japanese encodings which use multibyte characters (as discussed in this Devtools Issue). If you feed UTF-8 into the function, I can't find what is the problem.

@DrMaphuse , could you paste the result of sessionInfo() on your R console?

DrMaphuse commented 5 years ago

Sure!

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RcppMeCab_0.0.1.2

loaded via a namespace (and not attached):
[1] compiler_3.5.1     tools_3.5.1        Rcpp_0.12.17       RcppParallel_4.4.1
junhewk commented 5 years ago

@DrMaphuse , I'm so sorry that I can't reproduce your problem.

> Sys.setlocale("LC_ALL", "English_United Kingdom.1252")
 [1] "LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.1252;LC_MONETARY=English_United Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252" 
> library(RcppMeCab) 
> pos("これはぺんです", sys_dic="C:/PROGRA~2/MeCab/dic/ipadic") 
$`これはぺんです` 
[1] "これ/名詞"   "は/助詞"     "ぺん/名詞"   "です/助動詞"  
> pos("これはぺんです", join=FALSE, sys_dic="C:/PROGRA~2/MeCab/dic/ipadic")
$`これはぺんです`
  名詞   助詞   名詞 助動詞 
"これ"   "は" "ぺん" "です" 
> sessionInfo() 
R version 3.5.1 (2018-07-02) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows >= 8 x64 (build 9200) 

Matrix products: default 

locale: 
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 [4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252      

attached base packages: 
[1] stats     graphics  grDevices utils     datasets  methods   base       

other attached packages: 
[1] RcppMeCab_0.0.1.2  

loaded via a namespace (and not attached): 
[1] compiler_3.5.1     tools_3.5.1        yaml_2.1.19        Rcpp_0.12.17       RcppParallel_4.4.0

How about change R console's locale to English_United States.1252 via Sys.setlocale(category = "LC_ALL", locale = "English_United States.1252")? You can also try re-installing MeCab and selecting UTF-8 for the locale of IPA dictionary.

DrMaphuse commented 5 years ago

Thank you for these suggestions - I have tried Sys.setlocale(category = "LC_ALL", locale = "English_United States.1252") and Sys.setlocale("LC_ALL", "ja"), but unfortunately the List output is still empty.

I have selected UTF-8 for the MeCab installation, so that should be correct already, but I might try to reinstall.

Is it possible that I installed your package wrong? I installed with devtools and install_github(), with latest version of R, RStudio and RTools.