Closed rajkorde closed 7 years ago
Thanks for the bug report, @rajkorde. However these kinds of decisions about word boundaries aren't made directly by tokenizers. They made by the stringi package, which in turn relies on the ICU library. If you think that these are in error, I'd encourage you to file a bug report in one of those two places.
A development version of tokenizers should soon support tokenization by white space, which is not as smart but would be more consistent in your case.
Tokenization is inconsistent when numbers are followed by a period. In the example below, we23.exe gets tokenized but not icecream.exe.
d <- c("I like we23.exe", "I like icecream.exe") tokenize_words(d) [[1]] [1] "i" "like" "we23" "exe"
[[2]] [1] "i" "like" "icecream.exe"