r-lib / archive

R bindings to libarchive, supporting a large variety of archive formats
https://archive.r-lib.org/
Other
145 stars 17 forks source link

archive_extract keeps working after extracting the desired files from remote .zip #85

Closed pakom closed 7 months ago

pakom commented 1 year ago

I want to extract specific files from .zip file on the web. The archive is almost 1GB and has multiple folders. The files I am interested in are in folder TIMSS2019_IDB_SPSS_G8/Data/. Here is the code I use.

library(archive)

d <- tempdir()

options(timeout = 50000000000)

archive_extract(archive = "https://www.iea.nl/sites/default/files/data-repository/TIMSS/TIMSS2019/TIMSS2019_IDB_SPSS_G8.zip",
dir = d,
files = c("TIMSS2019_IDB_SPSS_G8/Data/bcgarem7.sav",
          "TIMSS2019_IDB_SPSS_G8/Data/bcgchlz7.sav"))

The two files are downloaded instantaneously, as they are below 150K. However, archive_extract keeps working for nearly seven minutes which is the time that I would need for downloading the entire .zip file. It does not matter if I want to extract just 2 or 200 files, archive_extract just keeps working. I can't figure out why, but it looks to me that archive_extract keeps looking for files with the same name. Is there any way to make it exit when the desired files are downloaded?

I observe exactly the same behavior when just try to list the files in the .zip using the archive function - provides the list of files, but then keeps working.

In addition, after archive_extract finally exits, R drops the following warning:

Warning message:
In file(archive, "rb") : NAs introduced by coercion to integer range
gaborcsardi commented 1 year ago

Yeah, it should stop, if all specified files are extracted and they are all files (i.e. not directories).

pakom commented 1 year ago

Thank you for your reply Gabor. Yes, they are all files. The archive function behaves the same when listing files in an archive.

pakom commented 1 year ago

Addressed by pull request #94

gaborcsardi commented 7 months ago

Closed by #94.