Closed rdatasculptor closed 7 years ago
Sorry for the earlier answer, I'm super jetlagged :) The hunspell_stem function performs stemming on individual words. Hence you need to tokenize the sentence first:
# load your dictionary
nl_dict <- dictionary("nl_NL.dic")
mytext <- "Hij heeft de klok wel horen luiden, maar weet niet waar de klepel hangt"
# tokenize and stem the words
words <- hunspell_parse(mytext, dict = nl_dict)
hunspell_stem(words[[1]], dict = nl_dict)
Thanks for your quick response!
This is the result of the script now:
[[1]]
[1] "hij"
[[2]]
[1] "heeft"
[[3]]
[1] "de"
[[4]]
[1] "klok"
[[5]]
[1] "wel"
[[6]]
[1] "horen" "hor"
[[7]]
[1] "luiden" "lui"
[[8]]
[1] "maar"
[[9]]
[1] "weet"
[[10]]
[1] "niet"
[[11]]
[1] "waar"
[[12]]
[1] "de"
[[13]]
[1] "klepel"
[[14]]
[1] "hangt"
I expected better stemming, This is not the result I wanted, but I guess it has something to de with the chosen dictionary.
Thanks again and have good recovery from your jetlag :-)
I agree that stemming doesn't seem to work very well for Dutch. It's not entirely useless, but only recognizes very basic conjugations:
> hunspell_stem('winkels', dict = nl_dict)
[[1]]
[1] "winkel"
> hunspell_stem('koekjes', dict = nl_dict)
[[1]]
[1] "koek"
> hunspell_stem('fietsen', dict = nl_dict)
[[1]]
[1] "fietsen" "fiets"
> hunspell_stem('fietste', dict = nl_dict)
[[1]]
[1] "fiets"
> hunspell_stem('fietsenrek', dict = nl_dict)
[[1]]
[1] "fietsen"
> hunspell_stem('fietsenrekje', dict = nl_dict)
[[1]]
[1] "fietsenrek"
Spelling checking generally seems OK though, at least for finding typos.
I followed the instructions of the package vignette (https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html#custom_dictionaries) and tried to add the custom dictionary dutch. I used this one: http://ftp.snt.utwente.nl/pub/software/openoffice/contrib/dictionaries/. I put the .dic and .aff files in the project working directory.
When I run this:
I get:
Does anyone have an idea about what's going wrong here?