r-lib / archive

R bindings to libarchive, supporting a large variety of archive formats
https://archive.r-lib.org/
Other
145 stars 17 forks source link

Possible to read from a multi-file archive? #77

Open pschloss opened 2 years ago

pschloss commented 2 years ago

I'm trying to read the contents of an archive (~3GB) with many little files in it (~100k) without decompressing the archive. I don't have the ability to reconfigure what the archive looks like. Here's a reprex of what the archives look like. The directory structure is the same and the files all have the same columns...

library(readr)

dir.create("data", showWarnings=FALSE)
write_csv(iris, "data/iris_a.csv")
write_csv(iris, "data/iris_b.csv")
write_csv(iris, "data/iris_c.csv")

archive_write_files("data.tar.gz",
                    c("data/",
                      "data/iris_a.csv",
                      "data/iris_b.csv",
                      "data/iris_c.csv"))

archive("data.tar.gz")

I'd like to do something like...

read_csv(c("data/iris_a.csv", "data/iris_b.csv", "data/iris_c.csv"), id = "file")

... but without first unpacking data.tar.gz.

If I do...

> read_csv(archive_read("data.tar.gz"), id = "file")
Rows: 0 Columns: 1

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 0 × 1
# … with 1 variable: file <chr>
# ℹ Use `colnames()` to see all variable names

Because it's reading the first entry which is the directory itself. I see that I can skip the first seat in the archive and instead do...

> read_csv(archive_read("data.tar.gz", 2), id = "file")
                                                                          Rows: 150 Columns: 6
── Column specification ──────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 150 × 6
   file                         Sepal.Length Sepal.Width Petal…¹ Petal…² Species
   <chr>                               <dbl>       <dbl>   <dbl>   <dbl> <chr>  
 1 archive_read(data.tar.gz)[2]          5.1         3.5     1.4     0.2 setosa 
 2 archive_read(data.tar.gz)[2]          4.9         3       1.4     0.2 setosa 
 3 archive_read(data.tar.gz)[2]          4.7         3.2     1.3     0.2 setosa 
 4 archive_read(data.tar.gz)[2]          4.6         3.1     1.5     0.2 setosa 
 5 archive_read(data.tar.gz)[2]          5           3.6     1.4     0.2 setosa 
 6 archive_read(data.tar.gz)[2]          5.4         3.9     1.7     0.4 setosa 
 7 archive_read(data.tar.gz)[2]          4.6         3.4     1.4     0.3 setosa 
 8 archive_read(data.tar.gz)[2]          5           3.4     1.5     0.2 setosa 
 9 archive_read(data.tar.gz)[2]          4.4         2.9     1.4     0.2 setosa 
10 archive_read(data.tar.gz)[2]          4.9         3.1     1.5     0.1 setosa 
# … with 140 more rows, and abbreviated variable names ¹​Petal.Length,
#   ²​Petal.Width
# ℹ Use `print(n = ...)` to see more rows

Building off of this, I could extract the contents of the archive and then step through each of the files with map_dfr...

library(purrr)

files <- archive("data.tar.gz")[["path"]][-1]
names(files) <- files
map_dfr(files, ~read_csv(archive_read("data.tar.gz", .x)), .id = "file")

Is there an easier way to read everything from the archive in without having to do the map_dfr step and incurring any other overhead from using both archive and archive_read?