ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
184 stars 25 forks source link

implicit conversion of character input to UTF-8 #87

Open ablaette opened 7 months ago

ablaette commented 7 months ago

tokenize_words() implicitly converts non-UTF-8-input to UTF-8. See the following example (latin1 in, UTF-8 out). As I was not aware of this behavior, this had caused me some headaches (see https://github.com/PolMine/cwbtools/issues/8#issue-415592225).

library(tokenizers)

c("Smørrebrød tastes great!") %>% 
  iconv(from = "UTF-8", to = "latin1") %>%
  tokenize_words(lowercase = FALSE, strip_punct = FALSE) %>%
  .[[1]] %>%
  Encoding()

[1] "UTF-8" "unknown" "unknown" "unknown"

Obviously, the times of 'latin1' are almost entirely over. But the documentation of the package is silent on this, the only reference to matters of encoding is in the 'Description' part of the DESCRIPTION file: "The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'."

Maybe include a sentence like this in the 'basic-tokenizers' documentation object? "Non-UTF-8 input is converted to UTF-8."