hunspell_stem() doesn't return stems with custom dictionary Dutch

ropensci / hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R

https://docs.ropensci.org/hunspell

Other

109 stars 44 forks source link

hunspell_stem() doesn't return stems with custom dictionary Dutch #19

Closed rdatasculptor closed 7 years ago

rdatasculptor commented 7 years ago

I followed the instructions of the package vignette (https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html#custom_dictionaries) and tried to add the custom dictionary dutch. I used this one: http://ftp.snt.utwente.nl/pub/software/openoffice/contrib/dictionaries/. I put the .dic and .aff files in the project working directory.

When I run this:

hunspell_stem("Hij heeft de klok wel horen luiden, maar weet niet waar de klepel hangt", 
dict="nl_NL.dic")

I get:

[[1]]
character(0)

Does anyone have an idea about what's going wrong here?

jeroen commented 7 years ago

Sorry for the earlier answer, I'm super jetlagged :) The hunspell_stem function performs stemming on individual words. Hence you need to tokenize the sentence first:

# load your dictionary
nl_dict <- dictionary("nl_NL.dic")
mytext <- "Hij heeft de klok wel horen luiden, maar weet niet waar de klepel hangt"

# tokenize and stem the words
words <- hunspell_parse(mytext, dict = nl_dict)
hunspell_stem(words[[1]], dict = nl_dict)

rdatasculptor commented 7 years ago

Thanks for your quick response!

This is the result of the script now:

[[1]]
[1] "hij"

[[2]]
[1] "heeft"

[[3]]
[1] "de"

[[4]]
[1] "klok"

[[5]]
[1] "wel"

[[6]]
[1] "horen" "hor"  

[[7]]
[1] "luiden" "lui"   

[[8]]
[1] "maar"

[[9]]
[1] "weet"

[[10]]
[1] "niet"

[[11]]
[1] "waar"

[[12]]
[1] "de"

[[13]]
[1] "klepel"

[[14]]
[1] "hangt"

I expected better stemming, This is not the result I wanted, but I guess it has something to de with the chosen dictionary.

Thanks again and have good recovery from your jetlag :-)

jeroen commented 7 years ago

I agree that stemming doesn't seem to work very well for Dutch. It's not entirely useless, but only recognizes very basic conjugations:

> hunspell_stem('winkels', dict = nl_dict)
[[1]]
[1] "winkel"

> hunspell_stem('koekjes', dict = nl_dict)
[[1]]
[1] "koek"

> hunspell_stem('fietsen', dict = nl_dict)
[[1]]
[1] "fietsen" "fiets"  

> hunspell_stem('fietste', dict = nl_dict)
[[1]]
[1] "fiets"

> hunspell_stem('fietsenrek', dict = nl_dict)
[[1]]
[1] "fietsen"

> hunspell_stem('fietsenrekje', dict = nl_dict)
[[1]]
[1] "fietsenrek"

Spelling checking generally seems OK though, at least for finding typos.