Closed artemklevtsov closed 8 years ago
Actually I think libhunspell
should pick up on certain environment variables to set the path, but I need to look into this a bit more.
It looks like the hunspell
command line utility searches the following paths:
#define LIBDIR \
"/usr/share/hunspell:" \
"/usr/share/myspell:" \
"/usr/share/myspell/dicts:" \
"/Library/Spelling"
Homebrew also suggests ~/Library/Spelling/
. The utility also checks for the DICPATH
environment variable. So we should probably do something similar.
Moreover it checks for WORDLIST
for custom words (i.e. the ignore
parameter currently)
Also note: RStudio Desktop package contains the hunspell
English dicts:
$ ls -1 /usr/lib/rstudio/resources/dictionaries/ | head -n 3
en_AG.aff@
en_AG.dic@
en_AU.aff@
Ah cool, that's nice.
Can you test if this works? I am mostly concerned with the character encoding stuff on non-latin alphabets....
Now it works:
hunspell::hunspell_analyze("Текст", dict = "ru_RU")
#> [[1]]
#> [1] " st:текст"
#>
Do hunspell_*
not works with vectors?
My session info:
sessionInfo()
#> R version 3.2.3 (2015-12-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#>
#> locale:
#> [1] LC_CTYPE=ru_RU.UTF-8 LC_NUMERIC=C LC_TIME=ru_RU.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=C LC_PAPER=ru_RU.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] tools_3.2.3 Rcpp_0.12.3 hunspell_1.1.9000
Yes it should work with vectors. What error do you get?
@jeroenooms sorry, my bad :) I tried "It workds fine"
instead c("It", "works", "fine")
.
Additional dictionaries may be installed with RStduio Desktop (Global Options > Spelling). For details: https://support.rstudio.com/hc/en-us/articles/200551916?version=0.99.892&mode=desktop
I found some dicts in the ~/.rstudio-desktop/dictionaries/languages-system
.
Hmm it doesn't work in windows. I was afraid of that.
What do you think about the get_dict(lang)
function which download a required language dictionary to the ~/Library/Spelling
?
I need to figure out the character encoding stuff first. This doesn't work on windows right now when the character encoding of the dictionary doesn't match that of the string in R.
I need to figure out the character encoding stuff first. This doesn't work on windows right now when the character encoding of the dictionary doesn't match that of the string in R.
First I would try stringi::stringi::stri_enc_toutf8
.
I don't want to depend on stringi. Maybe start with iconv.
What encoding is your dictionary in? Can you try: hunspell_info("ru_RU")
?
On Linux:
hunspell::hunspell_info(dict = "ru_RU")
#> $dict
#> [1] "/usr/share/hunspell/ru_RU.dic"
#>
#> $encoding
#> [1] "KOI8-R"
#>
#> $wordchars
#> [1] "-.'`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xa3\xb3\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff"
#>
UPD:
Sys.setenv("DICPATH" = "~/.rstudio-desktop/dictionaries/languages-system")
hunspell::hunspell_info(dict = "de_DE")
#> $dict
#> [1] "/home/xxx/.rstudio-desktop/dictionaries/languages-system/de_DE.dic"
#>
#> $encoding
#> [1] "ISO8859-1"
#>
#> $wordchars
#> [1] "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe"
But we can convert a dict files to UTF-8 and put in the system.file("dict", package = "hunspell")
.
@artemklevtsov would you mind testing the latest version some more? In particular hunspell_info()
and hunspell_find()
with some Russian sentences?
> hunspell_info()
$dict
[1] "/usr/share/hunspell/en_US.dic"
$encoding
[1] "UTF-8"
$wordchars
[1] "0123456789"
> hunspell_info("ru_RU")
*** caught segfault ***
address (nil), cause 'memory not mapped'
Traceback:
1: .Call("hunspell_R_hunspell_info", PACKAGE = "hunspell", affix, dict)
2: R_hunspell_info(get_affix(dict), get_dict(dict))
3: hunspell_info("ru_RU")
Hmm that's very strange. I'm getting this on all my systems:
> hunspell_info()
$dict
[1] "/usr/lib/rstudio-server/resources/dictionaries/en_US.dic"
$encoding
[1] "ISO8859-1"
$wordchars
[1] "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ"
and
> hunspell_info("ru_RU")
$dict
[1] "/home/jeroen/R/x86_64-pc-linux-gnu-library/3.2/hunspell/dict/ru_RU.dic"
$encoding
[1] "KOI8-R"
$wordchars
[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzёЁюабцдефгхийклмнопярстужвьызшэщчъЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ"
All my dicts in the /usr/share/hunspell/
in UTF-8. Russian dictionary source: http://extensions.libreoffice.org/extension-center/russian-spellcheck-dictionary.-based-on-works-of-aot-group/pscreleasefolder.2011-09-06.6209385965/0.4.0/dict_ru_ru-aot-0-4-0.oxt
Sorry for being an idiot but how do I get the aff
and dic
file from that oxt
?
ah never mind just renamed it to .tar.gz
and it worked.
unzip
also works.
These are called russian-aot
and they are not utf8 (.aff file contains SET KOI8-R
) but it works for me:
> hunspell_info("russian-aot")
$dict
[1] "/Users/jeroen/Downloads/ru_utf8/russian-aot.dic"
$encoding
[1] "KOI8-R"
$wordchars
[1] "-.'`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzёЁюабцдефгхийклмнопярстужвьызшэщчъЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ"
Can you maybe try again if it still happens?
Sorry for that. AOT used on my Desktop. Right link to reproduce: https://bitbucket.org/Shaman_Alex/russian-dictionary-hunspell/downloads/ru_RU_UTF-8_20131101.zip
OK thanks I understand the problem now.
Could you give it another try?
Seems it works now but characters field is missing:
hunspell::hunspell_info("ru_RU")
#> $dict
#> [1] "/usr/share/hunspell/ru_RU.dic"
#>
#> $encoding
#> [1] "KOI8-R"
#>
#> $wordchars
#> [1] "-.'`ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzёЁюабцдефгхийклмнопярстужвьызшэщчъЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ"
Sys.setenv("DICPATH" = "/tmp/")
hunspell::hunspell_info("ru_RU")
#> $dict
#> [1] "/tmp/ru_RU.dic"
#>
#> $encoding
#> [1] "UTF-8"
#>
#> $wordchars
#> [1] "NA"
Yes apparently utf8 dictionaries do not have a wordchar field, or at least not yours. Could you test some sentences with hunspell_find
to see if it picks up incorrect words with either dictionary?
Works fine:
hunspell_find("чёртова карова", dict = "ru_RU")
#> [[1]]
#> [1] "карова"
#>
This is on CRAN now. Thanks for your suggestions. Feel free to open new issues if you run into other problems.
Thank you for this nice package.
Hi.
I have installed
hunspell
package with many dictionaries. I can create a symlink or copy dicts but it would be better control search path with option. For example: