Currently, the library ships with a list of stopwords in various languages. PR #73 adds the ability to specify more directories to look for stopwords. This means one can only add more stopwords, but can't overwrite it, except. perhaps by setting the value of Hasher.STOPWORDS. However, a stopwords list does not suit in all situations, for sepcial purpose collections stopwords are differnt in the same language. And in some case stopwords are not desired at all.
The current implementation also strongly relies on the name of the file being the language code.
In reality, one classifier instance is only tied with one language and each classifier may want to use its own stopwords. It would be nice to be able to pass an array of stopwords or an arbitrary file path during the initialization of the classifier that can overwrite the value of Hasher.STOPWORDS[@language]. I should be able to make a PR for this if we decide to go for it.
Currently, the library ships with a list of stopwords in various languages. PR #73 adds the ability to specify more directories to look for stopwords. This means one can only add more stopwords, but can't overwrite it, except. perhaps by setting the value of
Hasher.STOPWORDS
. However, a stopwords list does not suit in all situations, for sepcial purpose collections stopwords are differnt in the same language. And in some case stopwords are not desired at all.The current implementation also strongly relies on the name of the file being the language code.
In reality, one classifier instance is only tied with one language and each classifier may want to use its own stopwords. It would be nice to be able to pass an array of stopwords or an arbitrary file path during the initialization of the classifier that can overwrite the value of
Hasher.STOPWORDS[@language]
. I should be able to make a PR for this if we decide to go for it.