sadit / SnowballStemmer.jl

Julia's wrapper for libstemmer
Other
2 stars 3 forks source link

Inconsistent and incorrect Results #2

Open mikesafar opened 6 years ago

mikesafar commented 6 years ago

I'm getting very odd results with the stemmer. Inconsistent. See below:

Main> SnowballStemmer.stem(SnowballStemmer.Stemmer("porter"), "Department of State")
"Department of St"

Main> SnowballStemmer.stem(SnowballStemmer.Stemmer("porter"), "State Department")
"State Depart"
sadit commented 6 years ago

Hi @mikesafar

It is intended to be used word by word, i.e., you can apply the split function or a better but complex tokenization process. For this purpose, you can take a look at

https://github.com/JuliaText/TextAnalysis.jl

or maybe,

https://github.com/sadit/TextModel.jl

The later is mine, but it is focused on a particular kind of application

Regards Eric

I'm getting very odd results with the stemmer. Inconsistent. See below:

Main> SnowballStemmer.stem(SnowballStemmer.Stemmer("porter"), "Department of State")
"Department of St"

Main> SnowballStemmer.stem(SnowballStemmer.Stemmer("porter"), "State Department")
"State Depart"
mikesafar commented 6 years ago

Thanks. That's what I wound up doing in the end. But BTW: I'm still getting some odd results that I don't like with short words. I set my algorithm to stem only words 4 letters or longer.

sadit commented 6 years ago

That is an implicit problem of stemmers. Perhaps you need a lemmatizer.

mikesafar commented 6 years ago

Addendum: I've found that the best solution is:

matchcall(r"\b\w+\b", text)

Split has too many exceptions, but the above does the trick, assuming that you can define a token with a regex, and you don't get crazy with it.