r-lib / archive

R bindings to libarchive, supporting a large variety of archive formats
https://archive.r-lib.org/
Other
141 stars 15 forks source link

Can't roundtrip a filepath with non-ascii characters in a non-UTF-8 locale #75

Open jennybc opened 2 years ago

jennybc commented 2 years ago

I discovered this while tightening up vroom's filepath handling. I made the reprex on Windows with R 4.1 and, anecdotally, have the same "no round trip" problem on ubuntu 18.04 with en_us locale (which is ISO-8819-1), which is included in vroom's test matrix. I think the cause / fix is likely different on the two platforms, though.

Based on recent experience in readxl and vroom, I'm going to hypothesize that cpp11 is now auto-converting a filepath to UTF-8 and then it's being re-encoded as UTF-8 (in error). I've also now realized that some of what I see interactively is not reflected in the reprex, maybe because of knitr's enforcement of UTF-8? :sob: I've tried to compensate for this.

R.version.string
#> [1] "R version 4.1.2 (2021-11-01)"
.Platform$OS.type
#> [1] "windows"
Sys.getlocale()
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
l10n_info()
#> $MBCS
#> [1] FALSE
#> 
#> $`UTF-8`
#> [1] FALSE
#> 
#> $`Latin-1`
#> [1] TRUE
#> 
#> $codepage
#> [1] 1252
#> 
#> $system.codepage
#> [1] 1252

I’m going to try writing bz2, gz, xz, and zip because I see these specific cases in vroom. First I pass UTF-8 encoded paths. I include brio for comparison.

make_temp_path <- function(ext) {
  file.path(tempdir(), paste0("d\u00E4t", ext))
}

(bz2file  <- withr::local_file(make_temp_path(".tar.bz2")))
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpUbfHcR/dät.tar.bz2"
(gzfile   <- withr::local_file(make_temp_path(".tar.gz")))
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpUbfHcR/dät.tar.gz"
(xzfile   <- withr::local_file(make_temp_path(".tar.xz")))
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpUbfHcR/dät.tar.xz"
(zipfile  <- withr::local_file(make_temp_path(".zip")))
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpUbfHcR/dät.zip"
(briofile <- withr::local_file(make_temp_path(".csv")))
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpUbfHcR/dät.csv"

write_archive_file <- function(file) {
  out_con <- archive::archive_write(file, "d\u00E4t.csv")
  write.csv(file = out_con, data.frame(a = "A", b = "B"))
}

# at first, 0 files
list.files(tempdir(), pattern = "^d")
#> character(0)

write_archive_file(gzfile)
write_archive_file(bz2file)
write_archive_file(xzfile)
write_archive_file(zipfile)
brio::write_lines("whatever", briofile)

# now there are 5 files, 4 mis-encoded ones from archive, 1 from brio
(x <- list.files(tempdir(), pattern = "^d"))
#> [1] "dät.tar.bz2" "dät.tar.gz"  "dät.tar.xz"  "dät.zip"     "dät.csv"
Encoding(x)
#> [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"
bz2file
#> [1] "C:\\Users\\jenny\\AppData\\Local\\Temp\\RtmpUbfHcR/dät.tar.bz2"
x[1]
#> [1] "dät.tar.bz2"
charToRaw(bz2file)
#>   [1] 43 3a 5c 55 73 65 72 73 5c 6a 65 6e 6e 79 5c 41 70 70 44 61 74 61 5c 4c 6f 63 61 6c
#>  [29] 5c 54 65 6d 70 5c 52 74 6d 70 41 76 59 6e 4f 33 2f 64 c3 a4 74 2e 74 61 72 2e 62 7a
#>  [57] 32
charToRaw(x[1]) # bz2file
#>  [1] 64 c3 83 c2 a4 74 2e 74 61 72 2e 62 7a 32

archive hasn’t written to the intended filepaths, brio has. What if I explicitly pass paths in the native encoding?

write_archive_file(enc2native(gzfile))
write_archive_file(enc2native(bz2file))
write_archive_file(enc2native(xzfile))
write_archive_file(enc2native(zipfile))
brio::write_lines("whatever", enc2native(briofile))

(x <- list.files(tempdir(), pattern = "^d"))
#> [1] "dät.tar.bz2" "dät.tar.gz"  "dät.tar.xz"  "dät.zip"     "dät.csv"

Passing natively encoded paths doesn’t help, i.e. we just overwrite the previous file paths.

What’s the nature of the problem? It looks like the UTF-8 bytes are being treated as Windows-1252 bytes and then getting re-encoded as UTF-8.

lapply(x, charToRaw)
#> [[1]]
#>  [1] 64 c3 83 c2 a4 74 2e 74 61 72 2e 62 7a 32
#>
#> [[2]]
#>  [1] 64 c3 83 c2 a4 74 2e 74 61 72 2e 67 7a
#>
#> [[3]]
#>  [1] 64 c3 83 c2 a4 74 2e 74 61 72 2e 78 7a
#>
#> [[4]]
#>  [1] 64 c3 83 c2 a4 74 2e 7a 69 70
#>
#> [[5]]
#>  [1] 64 c3 a4 74 2e 63 73 76

a_umlaut <- "\u00E4"
charToRaw(a_umlaut)
#> [1] c3 a4
iconv(a_umlaut, from = "Windows-1252", to = "UTF-8")
#> [1] "ä"
charToRaw(iconv(a_umlaut, from = "Windows-1252", to = "UTF-8"))
#> [1] c3 83 c2 a4

Can I read from the filepaths I tried to write to? No.

archive::archive_read(bz2file)
#> Warning in file(archive, "rb"): cannot open file 'C:
#> \Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.bz2': No such file or
#> directory
#> Error in file(archive, "rb"): cannot open the connection
archive::archive_read(gzfile)
#> Warning in file(archive, "rb"): cannot open file 'C:
#> \Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.gz': No such file or
#> directory
#> Error in file(archive, "rb"): cannot open the connection
archive::archive_read(xzfile)
#> Warning in file(archive, "rb"): cannot open file 'C:
#> \Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.xz': No such file or
#> directory
#> Error in file(archive, "rb"): cannot open the connection
archive::archive_read(zipfile)
#> Warning in file(archive, "rb"): cannot open file 'C:
#> \Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.zip': No such file or directory
#> Error in file(archive, "rb"): cannot open the connection

For the record, the files written by archive are just fine. And, for all but .zip, even the name of the included file, which itself has the same non-ascii character in it, is OK. For .zip, that file name is mis-encoded and, incidentally, the date looks wrong (1980?).

# the files written are fine, just the path is broken
find_file <- function(ext) {
  out <- 
    list.files(tempdir(), pattern = paste0("^d.*", ext, "$"), full.names = TRUE)
  cat("Reading from:\n", out, "\n")
  out
}

read.csv(archive::archive_read(find_file(".tar.gz")), row.names = 1)
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.gz
#>   a b
#> 1 A B
read.csv(archive::archive_read(find_file(".tar.bz2")), row.names = 1)
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.bz2
#>   a b
#> 1 A B
read.csv(archive::archive_read(find_file(".tar.xz")), row.names = 1)
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.xz
#>   a b
#> 1 A B
read.csv(archive::archive_read(find_file(".zip")), row.names = 1)
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.zip
#>   a b
#> 1 A B

archive::archive(find_file(".tar.gz"))
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.gz
#> # A tibble: 1 x 3
#>   path     size date               
#>   <chr>   <int> <dttm>             
#> 1 dät.csv    23 2022-05-12 19:10:28
archive::archive(find_file(".tar.bz2"))
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.bz2
#> # A tibble: 1 x 3
#>   path     size date               
#>   <chr>   <int> <dttm>             
#> 1 dät.csv    23 2022-05-12 19:10:28
archive::archive(find_file(".tar.xz"))
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.tar.xz
#> # A tibble: 1 x 3
#>   path     size date               
#>   <chr>   <int> <dttm>             
#> 1 dät.csv    23 2022-05-12 19:10:28

# uh-oh
archive::archive(find_file(".zip"))
#> Reading from:
#>  C:\Users\jenny\AppData\Local\Temp\RtmpUbfHcR/dät.zip
#> # A tibble: 1 x 3
#>   path             size date               
#>   <chr>           <int> <dttm>             
#> 1 "dA\u000ft.csv"    23 1980-01-01 00:00:00

brio::read_lines(briofile)
#> [1] "whatever"

Created on 2022-05-12 by the reprex package (v2.0.1)

jennybc commented 2 years ago

Here's the same drill, which goes sideways on ubuntu 18.04 with in a en_us locale. But the problem seems to be different.

R.version.string
#> [1] "R version 4.2.0 (2022-04-22)"
.Platform$OS.type
#> [1] "unix"
Sys.getlocale()
#> [1] "LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C"
l10n_info()
#> $MBCS
#> [1] FALSE
#> 
#> $`UTF-8`
#> [1] FALSE
#> 
#> $`Latin-1`
#> [1] TRUE
#> 
#> $codeset
#> [1] "ISO-8859-1"

I'm doing this by ssh'ing into a GHA worker, which uses tmux, so there are printing problems for non-ascii characters. But I also look at the marked encoding and bytes, which are not affected.

make_temp_path <- function(ext) {
  file.path(tempdir(), paste0("d\u00E4t", ext))
}

(bz2file  <- withr::local_file(make_temp_path(".tar.bz2")))
#> [1] "/tmp/Rtmp9rLgLO/d.tar.bz2"
(gzfile   <- withr::local_file(make_temp_path(".tar.gz")))
#> [1] "/tmp/Rtmp9rLgLO/d.tar.gz"
(xzfile   <- withr::local_file(make_temp_path(".tar.xz")))
#> [1] "/tmp/Rtmp9rLgLO/d.tar.xz"
(zipfile  <- withr::local_file(make_temp_path(".zip")))
#> [1] "/tmp/Rtmp9rLgLO/d.zip"
(briofile <- withr::local_file(make_temp_path(".csv")))
#> [1] "/tmp/Rtmp9rLgLO/d.csv"

write_archive_file <- function(file) {
  out_con <- archive::archive_write(file, "d\u00E4t.csv")
  write.csv(file = out_con, data.frame(a = "A", b = "B"))
}

list.files(tempdir(), pattern = "^d")
#> character(0)

write_archive_file(gzfile)
write_archive_file(bz2file)
write_archive_file(xzfile)
write_archive_file(zipfile)
brio::write_lines("whatever", briofile)

(x <- list.files(tempdir(), pattern = "^d"))
#> [1] "dät.tar.bz2" "dät.tar.gz"  "dät.tar.xz"  "dät.zip"     "d.csv"     
Encoding(x)
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown"

bz2file
#> [1] "/tmp/Rtmp9rLgLO/d.tar.bz2"
charToRaw(bz2file)
#>  [1] 2f 74 6d 70 2f 52 74 6d 70 39 72 4c 67 4c 4f 2f 64 c3 a4 74 2e 74 61 72 2e
#> [26] 62 7a 32
charToRaw(x[1]) # bz2file
#>  [1] 64 c3 a4 74 2e 74 61 72 2e 62 7a 32
charToRaw(x[5]) # bz2file
#> [1] 64 e4 74 2e 63 73 76

The path for the file written by brio seems to use the native encoding, e4 for ä, instead of c3 a4.

As before, explicitly calling enc2native() doesn't change anything. Whatever is happening seems to be inside archive.

write_archive_file(enc2native(gzfile))
write_archive_file(enc2native(bz2file))
write_archive_file(enc2native(xzfile))
write_archive_file(enc2native(zipfile))
brio::write_lines("whatever", enc2native(briofile))x <- list.files(tempdir(), pattern = "^d"))
#> [1] "dät.tar.bz2" "dät.tar.gz"  "dät.tar.xz"  "dät.zip"     "d.csv"     
lapply(x, charToRaw)
#> [[1]]
#>  [1] 64 c3 a4 74 2e 74 61 72 2e 62 7a 32
#> 
#> [[2]]
#>  [1] 64 c3 a4 74 2e 74 61 72 2e 67 7a
#> 
#> [[3]]
#>  [1] 64 c3 a4 74 2e 74 61 72 2e 78 7a
#> 
#> [[4]]
#> [1] 64 c3 a4 74 2e 7a 69 70
#> 
#> [[5]]
#> [1] 64 e4 74 2e 63 73 76

archive can't read from the file paths it wrote to.

archive::archive_read(bz2file)
#> Error in file(archive, "rb") : cannot open the connection
#> In addition: Warning message:
#> In file(archive, "rb") :
#>   cannot open file '/tmp/Rtmp9rLgLO/d.tar.bz2': No such file or directory

As before, the files are OK, once you get your hand on a path that pleases the OS.

find_file <- function(ext) {
  out <- 
    list.files(tempdir(), pattern = paste0("^d.*", ext, "$"), full.names = TRUE)
  cat("Reading from:\n", out, "\n")
  out
}
#> read.csv(archive::archive_read(find_file(".tar.bz2")), row.names = 1)
#> Reading from:
#>  /tmp/Rtmp9rLgLO/dät.tar.bz2 
#>   a b
#> 1 A B
archive::archive(find_file(".tar.bz2"))
#> Reading from:
#>  /tmp/Rtmp9rLgLO/dät.tar.bz2 
#> # A tibble: 1 x 3
#>   path            size date               
#>   <chr>          <int> <dttm>             
#> 1 "d\u00e4t.csv"    23 2022-05-14 20:17:09

brio can read from the same filepath that specified for writing:

brio::read_lines(briofile)
#> [1] "whatever"