saulpw / unzip-http

Extract individual files from .zip files over http without downloading the entire archive.
MIT License
262 stars 11 forks source link

offset out of range for 65536-byte buffer #14

Closed sergeyvilov closed 2 months ago

sergeyvilov commented 1 year ago

While attempting to download files from an ultra-large zip archive (355Gb) I got the following:

Traceback (most recent call last): File "/Users/sergey.vilov/tmp/test/test.py", line 5, in binfp = rzf.open('train_images/10005/18667/100.dcm') File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 192, in open f = list(self.matching_files(fn)) File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 186, in matching_files for f in self.files.values(): File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 109, in files self._files = {r.filename:r for r in self.infoiter()} File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 109, in self._files = {r.filename:r for r in self.infoiter()} File "/Users/sergey.vilov/miniconda/envs/kaggle/lib/python3.9/site-packages/unzip_http.py", line 151, in infoiter struct.unpack_from(self.fmt_cdirentry, resp.data, offset=filehdr_index) struct.error: offset -138557274 out of range for 65536-byte buff

The archive link can be obtained by downloading a Kaggle dataset from here. Unfortunately, I can't provide a direct link without exposing my kaggle credentials

saulpw commented 1 year ago

Thanks for the report, @sergeyvilov. I tried to get a URL to this dataset using my own kaggle account, but it seems like you either have to use their library/API, or save cookies from the browser session and use wget/curl. How did you manage to get a simple URL to use with unzip-http?

sergeyvilov commented 1 year ago

Hi @saulpw and thanks for the fast reply. I just clicked on Download All button under the Data Explorer on the Data tab, then canceled the download and chose Copy download link in the Firefox by clicking on the cancelled download in the Firefox download window. The link starts with https://storage.googleapis.com/kaggle-... and ends with .zip. It works then with wget or curl without cookies.

The problem with kaggle-api is that it can't download directories. When one tries to download files one-by-one, one gets the error 'Too many requests' after some time (for this dataset there're too many small .dcm files which raises Too many requests error). I thought that with unzip-http one could try to overcome this.

If unzip-http uses a separate HTTP request per file, then one may run into the same 'Too many requests' issue. If unzip-http could read the whole directory/directories with a single request (provided it's contiguous in the zip archive) or with the minimal number of requests, this would be awesome!

saulpw commented 1 year ago

Thanks, that allowed me to get the URL too. When I use this code it works fine for me:

import unzip_http

rzf = unzip_http.RemoteZipFile(URL_FROM_KAGGLE)
binfp = rzf.open('train_images/10005/18667/100.dcm')
binfp.read()

So I'm not sure why you're seeing that error. We did fix an issue like this awhile ago for 64-bit .zip files. Are you using the most recent version of unzip_http?

You're right though, that unzip-http does a separate HTTP request per file. It wouldn't be impossible to make unzip_http be able to download multiple contiguous files with one request, as you suggest, but it would be somewhat complicated and I unfortunately don't have the time at the moment to pull it together. If you're interested in making it happen I'd certainly review a PR for it.

sergeyvilov commented 1 year ago

Thank you very much for testing.

Looks very very strange. I'm using a clean conda environment with unzip-http 0.4 and Python 3.11 (also tried with 3.8) and execute the same code as you, the issue is observed on MacOS Big Sur (my home laptop), Ubuntu 18.04.6, and Rocky Linux 8.8. (remote servers) Again, wget with the same link works fine.

Concerning the number of requests, I think, it might be useful to merge individual requests for contiguous regions in future releases. I am pretty sure that not only the kaggle server limits the requests rate, so many users may also run into this issue when trying to download many files from the archive.

Ulipenitz commented 7 months ago

I get the same error. error: offset -25009427 out of range for 65536-byte buffer

I am trying to download parts of the DocLayNet_extra.zip from here: https://developer.ibm.com/exchanges/data/all/doclaynet/ I extracted the actual url from the HTML element.

import unzip_http

rzf = unzip_http.RemoteZipFile("https://codait-cos-dax.s3.us.cloud-object-storage.appdomain.cloud/dax-doclaynet/1.0.0/DocLayNet_extra.zip")
binfp = rzf.open('PDF/876c27352fc096c0572aa1141dd5e4465fec098b31d67c42df5ba955709b4979.pdf') # or 'DocLayNet_extra/PDF/....' not sure how to use the library yet
binfp.read()
dmetivie commented 5 months ago

Same here with

zip_url = 'https://knmi-ecad-assets-prd.s3.amazonaws.com/download/ECA_blend_rr.zip'
filename_in_zip = 'RR_STAID000031.txt'

Would it be enough to change the 65536 to a larger number (I have no idea if that make sense) when needed?

saulpw commented 2 months ago

So, I tried all these cases myself from the CLI and they seem to work fine with v0.5.1. For example:

$ ./unzip-http https://knmi-ecad-assets-prd.s3.amazonaws.com/download/ECA_blend_rr.zip RR_STAID000031.txt                                                                                  
Extracting RR_STAID000031.txt to RR_STAID000031.txt...
0s  0.26/0.18MB  (20.41 MB/s)  RR_STAID000031.txt

I think the problem is that we never pushed v0.5.1 to PyPI. So these problems should be fixed once we do that (hopefully tonight).

If any errors like this happen still with v0.5.1, please open a new issue.