r-lib / zip

Platform independent zip compression via miniz
https://r-lib.github.io/zip/
Other
83 stars 19 forks source link

Cannot list or extract from large ZIP file #75

Closed k5cents closed 1 year ago

k5cents commented 3 years ago

I am unable to get this large ZIP file to open so I can list the files inside. I can list with utils::unzip(list = TRUE) when using the /usr/bin/unzip internal method from Ubuntu 20.04. All the files in the ~3GB archive should be around 1GB for a total of ~24GB.

library(fs)
library(zip)
z <- file_temp(ext = "zip")
download.file(
  # warning: large file
  url = "https://files.usaspending.gov/award_data_archive/FY2020_All_Assistance_Full_20210308.zip",
  destfile = z,
  method = "curl"
)
file_size(z)
#> 3.14G
zip::zip_list(z)
#> Error in zip::zip_list(z) : 
#>   Cannot open zip file `/tmp/RtmpiS1Via/file1f413a5dd02c.zip`
utils::unzip(z, list = TRUE)
#>                                          Name     Length                Date
#> 1   FY2020_All_Assistance_Full_20210309_1.csv 1024502311 2021-03-09 05:50:00
#> 2   FY2020_All_Assistance_Full_20210309_2.csv 1024063334 2021-03-09 05:51:00
#> 3   FY2020_All_Assistance_Full_20210309_3.csv 1024246425 2021-03-09 05:51:00
#> 4   FY2020_All_Assistance_Full_20210309_4.csv 1024336940 2021-03-09 05:52:00
#> 5   FY2020_All_Assistance_Full_20210309_5.csv 1023755376 2021-03-09 05:53:00
#> 6   FY2020_All_Assistance_Full_20210309_6.csv 1023542726 2021-03-09 05:53:00
#> 7   FY2020_All_Assistance_Full_20210309_7.csv 1024207502 2021-03-09 05:54:00
#> 8   FY2020_All_Assistance_Full_20210309_8.csv 1022310586 2021-03-09 05:54:00
#> 9   FY2020_All_Assistance_Full_20210309_9.csv 1021702096 2021-03-09 05:55:00
#> 10 FY2020_All_Assistance_Full_20210309_10.csv 1021326140 2021-03-09 05:55:00
#> 11 FY2020_All_Assistance_Full_20210309_11.csv 1022692916 2021-03-09 05:56:00
#> 12 FY2020_All_Assistance_Full_20210309_12.csv 1021422379 2021-03-09 05:56:00
#> 13 FY2020_All_Assistance_Full_20210309_13.csv 1021028890 2021-03-09 05:57:00
#> 14 FY2020_All_Assistance_Full_20210309_14.csv 1020755288 2021-03-09 05:57:00
#> 15 FY2020_All_Assistance_Full_20210309_15.csv 1021372686 2021-03-09 05:58:00
#> 16 FY2020_All_Assistance_Full_20210309_16.csv 1021779105 2021-03-09 05:59:00
#> 17 FY2020_All_Assistance_Full_20210309_17.csv 1021351276 2021-03-09 05:59:00
#> 18 FY2020_All_Assistance_Full_20210309_18.csv 1022206279 2021-03-09 06:00:00
#> 19 FY2020_All_Assistance_Full_20210309_19.csv 1021372553 2021-03-09 06:00:00
#> 20 FY2020_All_Assistance_Full_20210309_20.csv 1021164581 2021-03-09 06:01:00
#> 21 FY2020_All_Assistance_Full_20210309_21.csv 1021823928 2021-03-09 06:01:00
#> 22 FY2020_All_Assistance_Full_20210309_22.csv 1020997871 2021-03-09 06:02:00
#> 23 FY2020_All_Assistance_Full_20210309_23.csv 1021256978 2021-03-09 06:02:00
#> 24 FY2020_All_Assistance_Full_20210309_24.csv 1021735986 2021-03-09 06:03:00
#> 25 FY2020_All_Assistance_Full_20210309_25.csv  356903330 2021-03-09 06:03:00

I also have a problem when extracting the files using either function, although using system2() to invoke the unzip command manually seems to work for me.

o <- utils::unzip(z, exdir = dirname(z))
#> Warning message:
#> In utils::unzip(z, exdir = dirname(z)) : zip file is corrupt
o <- zip::unzip(z, exdir = dirname(z))
#> Error in zip::unzip(z, exdir = dirname(z)) : 
#>   zip error: `Cannot open zip file `/tmp/RtmpiS1Via/file1f413a5dd02c.zip` for reading` in file `zip.c:140`
k5cents commented 3 years ago

Apologies, I see #65 now. I tried installing v2.0.4 per a comment there and get the same problem.

awd97 commented 3 years ago

How many files are there in this ZIP file? I'm appending files to an existing zip file and I notice that it won't open once the number of the files in the archive gets to 65535 - which might be the actual issue.

k5cents commented 3 years ago

@awd97 Nowhere near that many. Only 25, but each is about 1GB in size.

RodrigoZepeda commented 2 years ago

I'm also having the same issue and inside my zip there is only one file:

library(zip)

file_download_data <- tempfile()

#Download dataset----
  site.covid <- paste0(
    "http://datosabiertos.salud.gob.mx/gobmx/salud",
    "/datos_abiertos/datos_abiertos_covid19.zip"
  )

download.file(site.covid, file_download_data, method = "curl")

zip::unzip(file_download_data)

This results in

Error in zip::unzip(file_download_data) : 
  zip error: `Cannot extract entry `220411COVID19MEXICO.csv` from archive `/tmp/RtmpKnsWCA/file133dc5188912a`` in file `zip.c:219`

Doing

system2("unzip", args = c("-o",file_download_data))

works fine.

Attaching my session:

> sessionInfo()

    R version 4.1.3 (2022-03-10)
    Platform: x86_64-pc-linux-gnu (64-bit)
    Running under: Ubuntu 20.04.4 LTS

    Matrix products: default
    BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
    LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

    locale:
     [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_MX.UTF-8       
     [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_MX.UTF-8    LC_MESSAGES=en_US.UTF-8   
     [7] LC_PAPER=es_MX.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
    [10] LC_TELEPHONE=C             LC_MEASUREMENT=es_MX.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] zip_2.2.0.9000 RMariaDB_1.2.1

    loaded via a namespace (and not attached):
     [1] bit_4.0.4       compiler_4.1.3  ellipsis_0.3.2  cli_3.2.0       hms_1.1.1      
     [6] DBI_1.1.2       tools_4.1.3     Rcpp_1.0.8.3    bit64_4.0.5     vctrs_0.4.0    
    [11] lifecycle_1.0.1 pkgconfig_2.0.3 rlang_1.0.2    
jimjam-slam commented 1 year ago

I'm having a similar problem extracting zip files like https://climatedata-beta.environment.nsw.gov.au/download-collection/ae2c99ac-5ef1-44ef-abf9-10d63082f739 (about 3.93 GB to download, with 569 NetCDF .nc files expected inside):

This is on an M1 Mac:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] zip_2.2.2

loaded via a namespace (and not attached):
[1] compiler_4.1.2

I should also mention that getOption("unzip") reports "/usr/bin/unzip"... which I also get from which unzip. I'm not sure why I get different results with unzipping in bash versus unzipping with utils::unzip.

EDIT: my mistake! utils::unzip defaults to unzip = "internal", not getOption("unzip"). Using utils::unzip(file, unzip = getOption("unzip")) works for me!

gaborcsardi commented 1 year ago

Fixed by #79.