Add support for max_bytes option to crawl/crawlall (implemented in pluck)

open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk

https://kingfisher-collect.readthedocs.io

BSD 3-Clause "New" or "Revised" License

13 stars 12 forks source link

Add support for max_bytes option to crawl/crawlall (implemented in pluck) #312

Open jpmckinney opened 4 years ago

jpmckinney commented 4 years ago

Presently, if a publisher only offers bulk downloads, and if we only want to download a sample, then we still need to download the entire bulk download, which can be large, like in the case of Digiwhist's (opentender.eu's) files, for example.

It's possible to gunzip a partial file (and I think also possible to untar a partial file). The only question is whether we can either: (1) request Scrapy to stop downloading after a given number of bytes or (2) stream the response to a callback that can then close the connection once it's read enough bytes.

jpmckinney commented 4 years ago

Relevant to downloading large archive files like Digiwhist, which I think are presently loaded into memory.

jpmckinney commented 4 years ago

The only question is whether we can either: (1) request Scrapy to stop downloading after a given number of bytes or (2) stream the response to a callback that can then close the connection once it's read enough bytes.

Scrapy 2.2 adds a way to stop a download: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-stop-response-download

jpmckinney commented 3 years ago

When running pluck, I had noticed the memory usage grows and shrinks a fair bit (GBs). In #629, I implemented a solution to limit the bytes downloaded of non-compressed files and tar.gz files.

Update: I think the remaining peaks in memory usage are due to: (1) simultaneous processing and (2) large ZIPs, like georgia_opendata:

scrapy pluck --logfile pluck.log --loglevel=WARN --max-bytes 10000 --package-pointer /license georgia_opendata

jpmckinney commented 3 years ago

With respect to closing this issue, we can consider allowing max-bytes to be specified in the crawlall command (which is designed for downloading samples), and maybe also in the crawl command (as a spider argument).

yolile commented 3 years ago

(2) large ZIPs, like georgia_opendata

could we use resize_package for these cases? (I haven't downloaded the file yet so I don't know if it is just one big JSON file or a lot of small ones)

jpmckinney commented 3 years ago

I think the challenge is that to read even a part of a ZIP file, the ZIP file must be fully downloaded. Since we hold the ZIP file in memory (to avoid writing to disk which is a blocking operation), memory consumption will always be high for big ZIPs. It's not a problem so far, since the ZIPs are never too big.