Open mikesafar opened 6 years ago
Hi @mikesafar
It is intended to be used word by word, i.e., you can apply the split
function or a better but complex tokenization process. For this purpose, you can take a look at
https://github.com/JuliaText/TextAnalysis.jl
or maybe,
https://github.com/sadit/TextModel.jl
The later is mine, but it is focused on a particular kind of application
Regards Eric
I'm getting very odd results with the stemmer. Inconsistent. See below:
Main> SnowballStemmer.stem(SnowballStemmer.Stemmer("porter"), "Department of State") "Department of St" Main> SnowballStemmer.stem(SnowballStemmer.Stemmer("porter"), "State Department") "State Depart"
Thanks. That's what I wound up doing in the end. But BTW: I'm still getting some odd results that I don't like with short words. I set my algorithm to stem only words 4 letters or longer.
That is an implicit problem of stemmers. Perhaps you need a lemmatizer.
Addendum: I've found that the best solution is:
matchcall(r"\b\w+\b", text)
Split has too many exceptions, but the above does the trick, assuming that you can define a token with a regex, and you don't get crazy with it.
I'm getting very odd results with the stemmer. Inconsistent. See below: