jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
554 stars 110 forks source link

Custom stopwords file during classifier initialization #125

Closed ibnesayeed closed 7 years ago

ibnesayeed commented 7 years ago

Currently, the library ships with a list of stopwords in various languages. PR #73 adds the ability to specify more directories to look for stopwords. This means one can only add more stopwords, but can't overwrite it, except. perhaps by setting the value of Hasher.STOPWORDS. However, a stopwords list does not suit in all situations, for sepcial purpose collections stopwords are differnt in the same language. And in some case stopwords are not desired at all.

The current implementation also strongly relies on the name of the file being the language code.

In reality, one classifier instance is only tied with one language and each classifier may want to use its own stopwords. It would be nice to be able to pass an array of stopwords or an arbitrary file path during the initialization of the classifier that can overwrite the value of Hasher.STOPWORDS[@language]. I should be able to make a PR for this if we decide to go for it.

Ch4s3 commented 7 years ago

I could probably make that happen! I'll take a look tomorrow!

ibnesayeed commented 7 years ago

PR #129 should take care of it. However, we need code review and some test cases before we merge it.