implicit conversion of character input to UTF-8

tokenize_words() implicitly converts non-UTF-8-input to UTF-8. See the following example (latin1 in, UTF-8 out). As I was not aware of this behavior, this had caused me some headaches (see https://github.com/PolMine/cwbtools/issues/8#issue-415592225).

library(tokenizers)

c("Smørrebrød tastes great!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  .[[1]] %>%
  Encoding()

[1] "UTF-8" "unknown" "unknown" "unknown"

Obviously, the times of 'latin1' are almost entirely over. But the documentation of the package is silent on this, the only reference to matters of encoding is in the 'Description' part of the DESCRIPTION file: "The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'."

Maybe include a sentence like this in the 'basic-tokenizers' documentation object? "Non-UTF-8 input is converted to UTF-8."

ropensci / tokenizers

implicit conversion of character input to UTF-8 #87