Open jennybc opened 2 years ago
Here's the same drill, which goes sideways on ubuntu 18.04 with in a en_us
locale. But the problem seems to be different.
R.version.string
#> [1] "R version 4.2.0 (2022-04-22)"
.Platform$OS.type
#> [1] "unix"
Sys.getlocale()
#> [1] "LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C"
l10n_info()
#> $MBCS
#> [1] FALSE
#>
#> $`UTF-8`
#> [1] FALSE
#>
#> $`Latin-1`
#> [1] TRUE
#>
#> $codeset
#> [1] "ISO-8859-1"
I'm doing this by ssh'ing into a GHA worker, which uses tmux, so there are printing problems for non-ascii characters. But I also look at the marked encoding and bytes, which are not affected.
make_temp_path <- function(ext) {
file.path(tempdir(), paste0("d\u00E4t", ext))
}
(bz2file <- withr::local_file(make_temp_path(".tar.bz2")))
#> [1] "/tmp/Rtmp9rLgLO/d.tar.bz2"
(gzfile <- withr::local_file(make_temp_path(".tar.gz")))
#> [1] "/tmp/Rtmp9rLgLO/d.tar.gz"
(xzfile <- withr::local_file(make_temp_path(".tar.xz")))
#> [1] "/tmp/Rtmp9rLgLO/d.tar.xz"
(zipfile <- withr::local_file(make_temp_path(".zip")))
#> [1] "/tmp/Rtmp9rLgLO/d.zip"
(briofile <- withr::local_file(make_temp_path(".csv")))
#> [1] "/tmp/Rtmp9rLgLO/d.csv"
write_archive_file <- function(file) {
out_con <- archive::archive_write(file, "d\u00E4t.csv")
write.csv(file = out_con, data.frame(a = "A", b = "B"))
}
list.files(tempdir(), pattern = "^d")
#> character(0)
write_archive_file(gzfile)
write_archive_file(bz2file)
write_archive_file(xzfile)
write_archive_file(zipfile)
brio::write_lines("whatever", briofile)
(x <- list.files(tempdir(), pattern = "^d"))
#> [1] "dät.tar.bz2" "dät.tar.gz" "dät.tar.xz" "dät.zip" "d.csv"
Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown"
bz2file
#> [1] "/tmp/Rtmp9rLgLO/d.tar.bz2"
charToRaw(bz2file)
#> [1] 2f 74 6d 70 2f 52 74 6d 70 39 72 4c 67 4c 4f 2f 64 c3 a4 74 2e 74 61 72 2e
#> [26] 62 7a 32
charToRaw(x[1]) # bz2file
#> [1] 64 c3 a4 74 2e 74 61 72 2e 62 7a 32
charToRaw(x[5]) # bz2file
#> [1] 64 e4 74 2e 63 73 76
The path for the file written by brio seems to use the native encoding, e4
for ä, instead of c3 a4
.
As before, explicitly calling enc2native()
doesn't change anything. Whatever is happening seems to be inside archive.
write_archive_file(enc2native(gzfile))
write_archive_file(enc2native(bz2file))
write_archive_file(enc2native(xzfile))
write_archive_file(enc2native(zipfile))
brio::write_lines("whatever", enc2native(briofile))x <- list.files(tempdir(), pattern = "^d"))
#> [1] "dät.tar.bz2" "dät.tar.gz" "dät.tar.xz" "dät.zip" "d.csv"
lapply(x, charToRaw)
#> [[1]]
#> [1] 64 c3 a4 74 2e 74 61 72 2e 62 7a 32
#>
#> [[2]]
#> [1] 64 c3 a4 74 2e 74 61 72 2e 67 7a
#>
#> [[3]]
#> [1] 64 c3 a4 74 2e 74 61 72 2e 78 7a
#>
#> [[4]]
#> [1] 64 c3 a4 74 2e 7a 69 70
#>
#> [[5]]
#> [1] 64 e4 74 2e 63 73 76
archive can't read from the file paths it wrote to.
archive::archive_read(bz2file)
#> Error in file(archive, "rb") : cannot open the connection
#> In addition: Warning message:
#> In file(archive, "rb") :
#> cannot open file '/tmp/Rtmp9rLgLO/d.tar.bz2': No such file or directory
As before, the files are OK, once you get your hand on a path that pleases the OS.
find_file <- function(ext) {
out <-
list.files(tempdir(), pattern = paste0("^d.*", ext, "$"), full.names = TRUE)
cat("Reading from:\n", out, "\n")
out
}
#> read.csv(archive::archive_read(find_file(".tar.bz2")), row.names = 1)
#> Reading from:
#> /tmp/Rtmp9rLgLO/dät.tar.bz2
#> a b
#> 1 A B
archive::archive(find_file(".tar.bz2"))
#> Reading from:
#> /tmp/Rtmp9rLgLO/dät.tar.bz2
#> # A tibble: 1 x 3
#> path size date
#> <chr> <int> <dttm>
#> 1 "d\u00e4t.csv" 23 2022-05-14 20:17:09
brio can read from the same filepath that specified for writing:
brio::read_lines(briofile)
#> [1] "whatever"
I discovered this while tightening up vroom's filepath handling. I made the reprex on Windows with R 4.1 and, anecdotally, have the same "no round trip" problem on ubuntu 18.04 with
en_us
locale (which is ISO-8819-1), which is included in vroom's test matrix. I think the cause / fix is likely different on the two platforms, though.Based on recent experience in readxl and vroom, I'm going to hypothesize that cpp11 is now auto-converting a filepath to UTF-8 and then it's being re-encoded as UTF-8 (in error). I've also now realized that some of what I see interactively is not reflected in the reprex, maybe because of knitr's enforcement of UTF-8? :sob: I've tried to compensate for this.
I’m going to try writing bz2, gz, xz, and zip because I see these specific cases in vroom. First I pass UTF-8 encoded paths. I include brio for comparison.
archive hasn’t written to the intended filepaths, brio has. What if I explicitly pass paths in the native encoding?
Passing natively encoded paths doesn’t help, i.e. we just overwrite the previous file paths.
What’s the nature of the problem? It looks like the UTF-8 bytes are being treated as Windows-1252 bytes and then getting re-encoded as UTF-8.
Can I read from the filepaths I tried to write to? No.
For the record, the files written by archive are just fine. And, for all but
.zip
, even the name of the included file, which itself has the same non-ascii character in it, is OK. For.zip
, that file name is mis-encoded and, incidentally, the date looks wrong (1980?).Created on 2022-05-12 by the reprex package (v2.0.1)