ropensci / spelling

Tools for Spell Checking in R
105 stars 27 forks source link

Error in read_xml.raw: Input is not proper UTF-8, indicate encoding ! #70

Open DanChaltiel opened 1 year ago

DanChaltiel commented 1 year ago


Running spelling::spell_check_test() fails on the crosstable package with the following error:

#>Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
#>  Input is not proper UTF-8, indicate encoding !
#>Bytes: 0x93 0x63 0x79 0x94 [9]

I have no clue where this error can come from and the error message is unfortunately not very informative.

Would it be possible to terminate early from spelling instead of xml2 so that the path is in the error message?
Of course, if we can also have the line and the specific bad character, it would be even better!

Note that in this case, UTF8 is the default encoding in the package's DESCRIPTION and in RStudio parameters. R CMD CHECK completes without error so I guess any encoding problem is not that severe, don't you think?



After more debugging, it seems to pertain to this line:

In my case, it pointed to my file which indeed contained special characters. I have no idea how they ended up there though, and they are far too numerous that I can correct it manually (a knitting problem from README.Rmd I guess).


Since this confusing problem is not that rare (#52, #58, #62), a fix might be found useful.

Here are some proposals:

1) simply use a tryCatch() on xml2::xml_ns_strip() so that we can add path in the error message 2) add a warning in the specific case of non-UTF8 characters:

  text <- readLines(path, warn = FALSE, encoding = "UTF-8")
  invalid = !validUTF8(text)
    warning(message = c("The file ", path, " has non-UTF-8 characters on rows: ", paste(which(invalid), collapse=", ")))

3) use this trick from xfun::read_utf8() to ignore the problem (spell_check_package() will have no error):

  opts = options(encoding = "native.enc")
  on.exit(options(opts), add = TRUE)
  text <- readLines(path, warn = FALSE, encoding = "UTF-8")

We can do the 3 at the same time. I can make a PR if needed.