ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

tokenize_tweets and single word strings #70

Closed juliasilge closed 6 years ago

juliasilge commented 6 years ago

Hello! In juliasilge/tidytext#119, we have found what looks like an error for tokenize_tweets() in the case of an input character vector that is a single word/token.

library(tokenizers)

one_word_string <- "Word!"

tokenize_words(one_word_string)
#> [[1]]
#> [1] "word"
tokenize_sentences(one_word_string)
#> [[1]]
#> [1] "Word!"
tokenize_tweets(one_word_string)
#> Error in cut.default(seq_along(out), docindex, include.lowest = TRUE, : 'breaks' are not unique

Created on 2018-07-25 by the reprex package (v0.2.0).

Looks like it is happening right here when creating the output.

lmullen commented 6 years ago

This should be fixed on the master branch. I would appreciate it if you would test this in whatever circumstances are causing the error in tidytext, @juliasilge.

juliasilge commented 6 years ago

It is all fixed now and behaving correctly. Thanks so much, @lmullen! 🙌