ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Depend on the stopwords package instead of providing that functionality #46

Closed randomgambit closed 7 years ago

randomgambit commented 7 years ago

Hello,

I was wondering is there was a dictionary of stop-words available in tokenizers. I plan to use this package along with text2vec but I dont see any way to remove stopwords...

Any ideas?

Thanks!

randomgambit commented 7 years ago

my bad, I just found out the option.

kbenoit commented 6 years ago

There is a package stopwords that contains the full language set of Snowball stopwords, plus an additional set containing many others. This is a single function stopwords(language = “en”, source = “snowball”) (e.g.) that provides the ability to extend the sources using the source argument. The return is a simple character vector.

We moved the native data objects out of quanteda when we realised that multiple packages were each creating duplicates of the same objects, and it would be more efficient to centralize them and provide guidelines for using them in other packages.

Might be something to consider for including in tokenizers - and we would be happy to add the custom word lists (e.g. Jockers) as an additional “source”.

lmullen commented 6 years ago

@kbenoit Brilliant! Thanks for bringing the stopwords package to my attention. I have removed all the stopwords from tokenizers and taken a dependency on that package instead. I'm delighted to be out of the stopwords business.

If Jockers is willing, I will send a pull request with those stopwords to the stopwords package later.