Closed randomgambit closed 7 years ago
my bad, I just found out the option.
There is a package stopwords that contains the full language set of Snowball stopwords, plus an additional set containing many others. This is a single function stopwords(language = “en”, source = “snowball”)
(e.g.) that provides the ability to extend the sources using the source
argument. The return is a simple character vector.
We moved the native data objects out of quanteda when we realised that multiple packages were each creating duplicates of the same objects, and it would be more efficient to centralize them and provide guidelines for using them in other packages.
Might be something to consider for including in tokenizers - and we would be happy to add the custom word lists (e.g. Jockers) as an additional “source”.
@kbenoit Brilliant! Thanks for bringing the stopwords package to my attention. I have removed all the stopwords from tokenizers and taken a dependency on that package instead. I'm delighted to be out of the stopwords business.
If Jockers is willing, I will send a pull request with those stopwords to the stopwords package later.
Hello,
I was wondering is there was a dictionary of stop-words available in
tokenizers
. I plan to use this package along withtext2vec
but I dont see any way to remove stopwords...Any ideas?
Thanks!