ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

Inconsistent tokenizing when numbers are followed by a period #52

Closed rajkorde closed 7 years ago

rajkorde commented 7 years ago

Tokenization is inconsistent when numbers are followed by a period. In the example below, we23.exe gets tokenized but not icecream.exe.

d <- c("I like we23.exe", "I like icecream.exe") tokenize_words(d) [[1]] [1] "i" "like" "we23" "exe"

[[2]] [1] "i" "like" "icecream.exe"

lmullen commented 7 years ago

Thanks for the bug report, @rajkorde. However these kinds of decisions about word boundaries aren't made directly by tokenizers. They made by the stringi package, which in turn relies on the ICU library. If you think that these are in error, I'd encourage you to file a bug report in one of those two places.

A development version of tokenizers should soon support tokenization by white space, which is not as smart but would be more consistent in your case.