r-lib / archive

R bindings to libarchive, supporting a large variety of archive formats
https://archive.r-lib.org/
Other
145 stars 17 forks source link

archive_extract fails on 7-zip archives > 2 GB #81

Open ryanpcole opened 1 year ago

ryanpcole commented 1 year ago

I am trying to programmatically extract raster files downloaded from USGS. They are in 7z format. Strangely, I am unable to extract the 7z archives if they are greater than ~2 GB. Here's an example demonstrating the error I get when I try downloading and extracting a 2.5 GB archive. I've also included links to the archives from USGS.

library(archive)

small_7z_link <- "https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/NHDPlusHR/Beta/GDB/NHDPLUS_H_1708_HU4_RASTER.7z"
large_7z_link <- "https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/NHDPlusHR/Beta/GDB/NHDPLUS_H_1709_HU4_RASTER.7z"

# Download files - these are about 4GB in total
download.file(small_7z_link,
              destfile = "small_raster.7z",
              mode = "wb")
download.file(large_7z_link,
              destfile = "large_raster.7z",
              mode = "wb")

# Attempt to access archive information
archive("small_raster.7z")
#> # A tibble: 105 × 3
#>    path                                                size date               
#>    <chr>                                              <int> <dttm>             
#>  1 HRNHDPlusRasters1708/elev_source.gdb/gdb               4 2018-05-21 08:20:13
#>  2 HRNHDPlusRasters1708/elev_source.gdb/timestamps      400 2018-05-21 08:32:57
#>  3 HRNHDPlusRasters1708/shdrelief.jp2              61676301 2018-05-21 08:31:47
#>  4 HRNHDPlusRasters1708/cat.tif.aux.xml                2729 2018-06-20 07:44:18
#>  5 HRNHDPlusRasters1708/cat.tif.xml                    5608 2018-06-20 07:44:18
#>  6 HRNHDPlusRasters1708/catseed.tif.aux.xml            2460 2018-06-20 07:40:42
#>  7 HRNHDPlusRasters1708/catseed.tif.xml                5867 2018-06-20 07:40:42
#>  8 HRNHDPlusRasters1708/elev_cm.tif.aux.xml            2558 2018-06-12 08:01:59
#>  9 HRNHDPlusRasters1708/elev_cm.tif.xml                2591 2018-05-21 08:32:58
#> 10 HRNHDPlusRasters1708/fac.tif.aux.xml                1644 2018-06-20 10:23:14
#> # … with 95 more rows

archive("large_raster.7z")
#> # A tibble: 0 × 3
#> # … with 3 variables: path <chr>, size <int>, date <dttm>

# Attempt to extract files

# Succeeds
archive_extract("small_raster.7z")
#> ⠙ 2 extracted | 53 MB ( 25 MB/s) | 2.1s ⠹ 2 extracted | 58 MB ( 25 MB/s) |
#> 2.3s ⠸ 26 extracted | 72 MB ( 28 MB/s) | 2.5s ⠼ 80 extracted | 78 MB ( 28... 
( I truncated the output for readability)

# Fails
archive_extract("large_raster.7z")
#> Error: archive_extract.cpp:166 archive_read_next_header(): Truncated 7-Zip file body

I don't know what the Truncated 7-zip file body header means. I am able to extract the archive correctly using 7-zip, so it appears they aren't corrupted.

These are multi-file archives, so each archive contains multiple raster files. Interestingly, I am able to use archive_extract if I delete some of the files contained in the larger archive to make it less than 2 GB in size. I've tried this on many other raster archives from USGS, and extracting the archive always fails if the archive size is > ~2 GB. Any ideas what might be going on here? Thanks!

robchallen commented 1 year ago

I'm facing the same issue. Is there any workaround?

cielavenir commented 1 year ago

should be fixed by https://github.com/r-lib/archive/pull/87 but it is kind of hack..