ropensci / hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R
https://docs.ropensci.org/hunspell
Other
109 stars 44 forks source link

Use system dictionaries #3

Closed artemklevtsov closed 8 years ago

artemklevtsov commented 8 years ago

Hi.

I have installed hunspell package with many dictionaries. I can create a symlink or copy dicts but it would be better control search path with option. For example:

options(hunspell.path = "/usr/share/hunspell/")
get_affix <- function(lang) {
  default <- system.file("dict", lang, paste0( lang, ".aff"), package = "hunspell")
  path <- getOption("hunspell.path", default)
  normalizePath(path, mustWork = TRUE)
}

get_dict <- function(lang) {
  default <- system.file("dict", lang, paste0( lang, ".dic"), package = "hunspell")
  path <- getOption("hunspell.path", default)
  normalizePath(path, mustWork = TRUE)
}
jeroen commented 8 years ago

Actually I think libhunspell should pick up on certain environment variables to set the path, but I need to look into this a bit more.

jeroen commented 8 years ago

It looks like the hunspell command line utility searches the following paths:

#define LIBDIR                \
  "/usr/share/hunspell:"      \
  "/usr/share/myspell:"       \
  "/usr/share/myspell/dicts:" \
  "/Library/Spelling"

Homebrew also suggests ~/Library/Spelling/. The utility also checks for the DICPATH environment variable. So we should probably do something similar.

Moreover it checks for WORDLIST for custom words (i.e. the ignore parameter currently)

jeroen commented 8 years ago

First attempt: https://github.com/jeroenooms/hunspell/commit/dabcd65fc371406da4ae19953211d60c261fa27e

artemklevtsov commented 8 years ago

Also note: RStudio Desktop package contains the hunspell English dicts:

$ ls -1 /usr/lib/rstudio/resources/dictionaries/ | head -n 3
en_AG.aff@
en_AG.dic@
en_AU.aff@
jeroen commented 8 years ago

Ah cool, that's nice.

jeroen commented 8 years ago

Can you test if this works? I am mostly concerned with the character encoding stuff on non-latin alphabets....

artemklevtsov commented 8 years ago

Now it works:

hunspell::hunspell_analyze("Текст", dict = "ru_RU")
#> [[1]]
#> [1] " st:текст"
#> 

Do hunspell_* not works with vectors?

My session info:

sessionInfo()
#> R version 3.2.3 (2015-12-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> locale:
#>  [1] LC_CTYPE=ru_RU.UTF-8       LC_NUMERIC=C               LC_TIME=ru_RU.UTF-8        LC_COLLATE=C              
#>  [5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=C              LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#> [1] tools_3.2.3       Rcpp_0.12.3       hunspell_1.1.9000
jeroen commented 8 years ago

Yes it should work with vectors. What error do you get?

artemklevtsov commented 8 years ago

@jeroenooms sorry, my bad :) I tried "It workds fine" instead c("It", "works", "fine").

Additional dictionaries may be installed with RStduio Desktop (Global Options > Spelling). For details: https://support.rstudio.com/hc/en-us/articles/200551916?version=0.99.892&mode=desktop

I found some dicts in the ~/.rstudio-desktop/dictionaries/languages-system.

jeroen commented 8 years ago

Hmm it doesn't work in windows. I was afraid of that.

artemklevtsov commented 8 years ago

What do you think about the get_dict(lang) function which download a required language dictionary to the ~/Library/Spelling?

jeroen commented 8 years ago

I need to figure out the character encoding stuff first. This doesn't work on windows right now when the character encoding of the dictionary doesn't match that of the string in R.

artemklevtsov commented 8 years ago

I need to figure out the character encoding stuff first. This doesn't work on windows right now when the character encoding of the dictionary doesn't match that of the string in R.

First I would try stringi::stringi::stri_enc_toutf8.

jeroen commented 8 years ago

I don't want to depend on stringi. Maybe start with iconv.

jeroen commented 8 years ago

What encoding is your dictionary in? Can you try: hunspell_info("ru_RU") ?

artemklevtsov commented 8 years ago

On Linux:

hunspell::hunspell_info(dict = "ru_RU")
#> $dict
#> [1] "/usr/share/hunspell/ru_RU.dic"
#> 
#> $encoding
#> [1] "KOI8-R"
#> 
#> $wordchars
#> [1] "-.'`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xa3\xb3\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
#> 

UPD:

Sys.setenv("DICPATH" = "~/.rstudio-desktop/dictionaries/languages-system")
hunspell::hunspell_info(dict = "de_DE")
#> $dict
#> [1] "/home/xxx/.rstudio-desktop/dictionaries/languages-system/de_DE.dic"
#> 
#> $encoding
#> [1] "ISO8859-1"
#> 
#> $wordchars
#> [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe" 

But we can convert a dict files to UTF-8 and put in the system.file("dict", package = "hunspell").

jeroen commented 8 years ago

@artemklevtsov would you mind testing the latest version some more? In particular hunspell_info() and hunspell_find() with some Russian sentences?

artemklevtsov commented 8 years ago
> hunspell_info()
$dict
[1] "/usr/share/hunspell/en_US.dic"

$encoding
[1] "UTF-8"

$wordchars
[1] "0123456789"

> hunspell_info("ru_RU")

 *** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: .Call("hunspell_R_hunspell_info", PACKAGE = "hunspell", affix,     dict)
 2: R_hunspell_info(get_affix(dict), get_dict(dict))
 3: hunspell_info("ru_RU")
jeroen commented 8 years ago

Hmm that's very strange. I'm getting this on all my systems:

> hunspell_info()
$dict
[1] "/usr/lib/rstudio-server/resources/dictionaries/en_US.dic"

$encoding
[1] "ISO8859-1"

$wordchars
[1] "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ"

and

> hunspell_info("ru_RU")
$dict
[1] "/home/jeroen/R/x86_64-pc-linux-gnu-library/3.2/hunspell/dict/ru_RU.dic"

$encoding
[1] "KOI8-R"

$wordchars
[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzёЁюабцдефгхийклмнопярстужвьызшэщчъЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ"
artemklevtsov commented 8 years ago

All my dicts in the /usr/share/hunspell/ in UTF-8. Russian dictionary source: http://extensions.libreoffice.org/extension-center/russian-spellcheck-dictionary.-based-on-works-of-aot-group/pscreleasefolder.2011-09-06.6209385965/0.4.0/dict_ru_ru-aot-0-4-0.oxt

jeroen commented 8 years ago

Sorry for being an idiot but how do I get the aff and dic file from that oxt?

jeroen commented 8 years ago

ah never mind just renamed it to .tar.gz and it worked.

artemklevtsov commented 8 years ago

unzip also works.

jeroen commented 8 years ago

These are called russian-aot and they are not utf8 (.aff file contains SET KOI8-R) but it works for me:


> hunspell_info("russian-aot")
$dict
[1] "/Users/jeroen/Downloads/ru_utf8/russian-aot.dic"

$encoding
[1] "KOI8-R"

$wordchars
[1] "-.'`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzёЁюабцдефгхийклмнопярстужвьызшэщчъЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ"
jeroen commented 8 years ago

Can you maybe try again if it still happens?

artemklevtsov commented 8 years ago

Sorry for that. AOT used on my Desktop. Right link to reproduce: https://bitbucket.org/Shaman_Alex/russian-dictionary-hunspell/downloads/ru_RU_UTF-8_20131101.zip

jeroen commented 8 years ago

OK thanks I understand the problem now.

jeroen commented 8 years ago

Could you give it another try?

artemklevtsov commented 8 years ago

Seems it works now but characters field is missing:

hunspell::hunspell_info("ru_RU")
#> $dict
#> [1] "/usr/share/hunspell/ru_RU.dic"
#> 
#> $encoding
#> [1] "KOI8-R"
#> 
#> $wordchars
#> [1] "-.'`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzёЁюабцдефгхийклмнопярстужвьызшэщчъЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ"
Sys.setenv("DICPATH" = "/tmp/")
hunspell::hunspell_info("ru_RU")
#> $dict
#> [1] "/tmp/ru_RU.dic"
#> 
#> $encoding
#> [1] "UTF-8"
#> 
#> $wordchars
#> [1] "NA"
jeroen commented 8 years ago

Yes apparently utf8 dictionaries do not have a wordchar field, or at least not yours. Could you test some sentences with hunspell_find to see if it picks up incorrect words with either dictionary?

artemklevtsov commented 8 years ago

Works fine:

hunspell_find("чёртова карова", dict = "ru_RU")
#> [[1]]
#> [1] "карова"
#> 
jeroen commented 8 years ago

This is on CRAN now. Thanks for your suggestions. Feel free to open new issues if you run into other problems.

artemklevtsov commented 8 years ago

Thank you for this nice package.