tidyverse / readr

Read flat files (csv, tsv, fwf) into R
1.01k stars 286 forks source link

Confusing error for non-ascii input #1521

Open mine-cetinkaya-rundel opened 10 months ago

mine-cetinkaya-rundel commented 10 months ago

This is from R4DS.


x1 <- "text\nEl Ni\xf1o was particularly bad this year"
#> Warning in grepl("\n", path): unable to translate 'text
#> El Ni<f1>o was particularly bad this year' to a wide string
#> Warning in grepl("\n", path): input string 1 is invalid
#> Warning in grepl("^((http|ftp)s?|sftp)://", path): unable to translate 'text
#> El Ni<f1>o was particularly bad this year' to a wide string
#> Warning in grepl("^((http|ftp)s?|sftp)://", path): input string 1 is invalid
#> Warning in regexpr(regex, path, perl = TRUE): input string 1 is invalid UTF-8
#> Warning in grepl("^(/|[A-Za-z]:|\\\\|~)", path): unable to translate 'text
#> El Ni<f1>o was particularly bad this year' to a wide string
#> Warning in grepl("^(/|[A-Za-z]:|\\\\|~)", path): input string 1 is invalid
#> Error: 'text El Ni<f1>o was particularly bad this year' does not exist in
#> current working directory
#> ('/private/var/folders/3_/xjzgh7dj511d5t7996fq_1sm0000gn/T/RtmpqIT6YG/reprex-a72c13762e55-cilia-stag').

Created on 2023-11-09 with reprex v2.0.2

jennybc commented 10 months ago

In terms of what's happening, read_csv() is really calling vroom::vroom(delim = ",", locale = default_locale()) and the default locale has UTF-8 encoding. The escape sequence \xf1 is not valid UTF-8, which is the root cause of all the problems.

A complicating factor is that the input x1 is clearly intended as literal input, but it's being processed as a file path. The main error is reported last above:

#> Error: 'text El Ni<f1>o was particularly bad this year' does not exist in
#> current working directory

Then we're also getting lots of base R warnings from a failed file existence check, since the file path is not valid UTF-8.

I'm pretty surprised this code ever worked or that it worked in recent memory. I'll have a think on whether we can improve on the error. But the error and warnings above do actually explain what's wrong, albeit it in a rather cryptic way.

If you want to update the code, here are some ideas. Key changes are to explicitly convert from latin1 to UTF-8 and to use I() to indicate literal input.

If you want to keep using the \x escape sequence, then you'll need to convert that string to UTF-8:


x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(I(iconv(x1, "latin1", "utf-8")), show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
vroom(I(iconv(x1, "latin1", "utf-8")), delim = ",", show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"

But if this is just about using an accented character in literal input, then use a \u escape sequence instead, to get a UTF-8 string.

x1 <- "text\nEl Ni\u00F1o was particularly bad this year"
read_csv(I(x1), show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"

Created on 2023-11-09 with reprex v2.0.2.9000

jennybc commented 10 months ago

The noise/errors around the original example (which originates in R4DS) have probably gotten worse over time due to changes in base R. Some relevant items from NEWS:

  • 4.3.0: Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8).
  • R 4.0.0: Most functions with file-path inputs will give an explicit error if a file-path input in a marked encoding cannot be translated (to the native encoding or in some cases on Windows to UTF-8), rather than translate to a different file path using escapes. Some (such as dir.exists(), file.exists(), file.access(), file.info(), list.files(), normalizePath() and path.expand()) treat this like any other non-existent file, often with a warning.