ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Inconsistent behavior of tokenize_tweets() when filtering stopwords with punctuation #76

Closed syumet closed 4 years ago

syumet commented 4 years ago

Consider this example:

> library(tokenizers)
> tokenize_words("i'm happy!", stopwords = c("i'm"), strip_punct = T)
[[1]]
[1] "happy"

> tokenize_tweets("i'm happy!", stopwords = c("i'm"), strip_punct = T)
[[1]]
[1] "im"    "happy"

From my observation, tokenize_tweets() will remove punctuations before cleaning stopwords, that's probably the cause of the problem.

kbenoit commented 4 years ago

Fixed in a PR (still pending), thanks.

The quanteda package has a much upgraded default tokenizer in v2 by the way that handled social media tags even better and faster than tokenize_tweets(), without the problems you noticed.

lmullen commented 4 years ago

Thanks for the fix, @kbenoit.

@syumet: You should be able to install the development version with the fix via the remotes package.