I'm trying to read the contents of an archive (~3GB) with many little files in it (~100k) without decompressing the archive. I don't have the ability to reconfigure what the archive looks like. Here's a reprex of what the archives look like. The directory structure is the same and the files all have the same columns...
read_csv(c("data/iris_a.csv", "data/iris_b.csv", "data/iris_c.csv"), id = "file")
... but without first unpacking data.tar.gz.
If I do...
> read_csv(archive_read("data.tar.gz"), id = "file")
Rows: 0 Columns: 1
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 0 × 1
# … with 1 variable: file <chr>
# ℹ Use `colnames()` to see all variable names
Because it's reading the first entry which is the directory itself. I see that I can skip the first seat in the archive and instead do...
> read_csv(archive_read("data.tar.gz", 2), id = "file")
Rows: 150 Columns: 6
── Column specification ──────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 150 × 6
file Sepal.Length Sepal.Width Petal…¹ Petal…² Species
<chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 archive_read(data.tar.gz)[2] 5.1 3.5 1.4 0.2 setosa
2 archive_read(data.tar.gz)[2] 4.9 3 1.4 0.2 setosa
3 archive_read(data.tar.gz)[2] 4.7 3.2 1.3 0.2 setosa
4 archive_read(data.tar.gz)[2] 4.6 3.1 1.5 0.2 setosa
5 archive_read(data.tar.gz)[2] 5 3.6 1.4 0.2 setosa
6 archive_read(data.tar.gz)[2] 5.4 3.9 1.7 0.4 setosa
7 archive_read(data.tar.gz)[2] 4.6 3.4 1.4 0.3 setosa
8 archive_read(data.tar.gz)[2] 5 3.4 1.5 0.2 setosa
9 archive_read(data.tar.gz)[2] 4.4 2.9 1.4 0.2 setosa
10 archive_read(data.tar.gz)[2] 4.9 3.1 1.5 0.1 setosa
# … with 140 more rows, and abbreviated variable names ¹Petal.Length,
# ²Petal.Width
# ℹ Use `print(n = ...)` to see more rows
Building off of this, I could extract the contents of the archive and then step through each of the files with map_dfr...
Is there an easier way to read everything from the archive in without having to do the map_dfr step and incurring any other overhead from using both archive and archive_read?
I'm trying to read the contents of an archive (~3GB) with many little files in it (~100k) without decompressing the archive. I don't have the ability to reconfigure what the archive looks like. Here's a reprex of what the archives look like. The directory structure is the same and the files all have the same columns...
I'd like to do something like...
... but without first unpacking
data.tar.gz
.If I do...
Because it's reading the first entry which is the directory itself. I see that I can skip the first seat in the archive and instead do...
Building off of this, I could extract the contents of the archive and then step through each of the files with
map_dfr
...Is there an easier way to read everything from the archive in without having to do the
map_dfr
step and incurring any other overhead from using botharchive
andarchive_read
?