Open mine-cetinkaya-rundel opened 1 year ago
In terms of what's happening, read_csv()
is really calling vroom::vroom(delim = ",", locale = default_locale())
and the default locale has UTF-8 encoding. The escape sequence \xf1
is not valid UTF-8, which is the root cause of all the problems.
A complicating factor is that the input x1
is clearly intended as literal input, but it's being processed as a file path. The main error is reported last above:
#> Error: 'text El Ni<f1>o was particularly bad this year' does not exist in
#> current working directory
Then we're also getting lots of base R warnings from a failed file existence check, since the file path is not valid UTF-8.
I'm pretty surprised this code ever worked or that it worked in recent memory. I'll have a think on whether we can improve on the error. But the error and warnings above do actually explain what's wrong, albeit it in a rather cryptic way.
If you want to update the code, here are some ideas. Key changes are to explicitly convert from latin1 to UTF-8 and to use I()
to indicate literal input.
If you want to keep using the \x
escape sequence, then you'll need to convert that string to UTF-8:
library(readr)
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
read_csv(I(iconv(x1, "latin1", "utf-8")), show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"
library(vroom)
x1 <- "text\nEl Ni\xf1o was particularly bad this year"
vroom(I(iconv(x1, "latin1", "utf-8")), delim = ",", show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"
But if this is just about using an accented character in literal input, then use a \u
escape sequence instead, to get a UTF-8 string.
library(readr)
x1 <- "text\nEl Ni\u00F1o was particularly bad this year"
read_csv(I(x1), show_col_types = FALSE)$text
#> [1] "El Niño was particularly bad this year"
Created on 2023-11-09 with reprex v2.0.2.9000
The noise/errors around the original example (which originates in R4DS) have probably gotten worse over time due to changes in base R. Some relevant items from NEWS:
- 4.3.0: Regular expression functions now check more thoroughly whether their inputs are valid strings (in their encoding, e.g. in UTF-8).
- R 4.0.0: Most functions with file-path inputs will give an explicit error if a file-path input in a marked encoding cannot be translated (to the native encoding or in some cases on Windows to UTF-8), rather than translate to a different file path using escapes. Some (such as dir.exists(), file.exists(), file.access(), file.info(), list.files(), normalizePath() and path.expand()) treat this like any other non-existent file, often with a warning.
This is from R4DS.
Created on 2023-11-09 with reprex v2.0.2