Open jpmckinney opened 4 years ago
Relevant to downloading large archive files like Digiwhist, which I think are presently loaded into memory.
The only question is whether we can either: (1) request Scrapy to stop downloading after a given number of bytes or (2) stream the response to a callback that can then close the connection once it's read enough bytes.
Scrapy 2.2 adds a way to stop a download: https://docs.scrapy.org/en/latest/topics/request-response.html#topics-stop-response-download
When running pluck
, I had noticed the memory usage grows and shrinks a fair bit (GBs). In #629, I implemented a solution to limit the bytes downloaded of non-compressed files and tar.gz files.
Update: I think the remaining peaks in memory usage are due to: (1) simultaneous processing and (2) large ZIPs, like georgia_opendata:
scrapy pluck --logfile pluck.log --loglevel=WARN --max-bytes 10000 --package-pointer /license georgia_opendata
With respect to closing this issue, we can consider allowing max-bytes to be specified in the crawlall
command (which is designed for downloading samples), and maybe also in the crawl command (as a spider argument).
(2) large ZIPs, like georgia_opendata
could we use resize_package
for these cases? (I haven't downloaded the file yet so I don't know if it is just one big JSON file or a lot of small ones)
I think the challenge is that to read even a part of a ZIP file, the ZIP file must be fully downloaded. Since we hold the ZIP file in memory (to avoid writing to disk which is a blocking operation), memory consumption will always be high for big ZIPs. It's not a problem so far, since the ZIPs are never too big.
Presently, if a publisher only offers bulk downloads, and if we only want to download a sample, then we still need to download the entire bulk download, which can be large, like in the case of Digiwhist's (opentender.eu's) files, for example.
It's possible to gunzip a partial file (and I think also possible to untar a partial file). The only question is whether we can either: (1) request Scrapy to stop downloading after a given number of bytes or (2) stream the response to a callback that can then close the connection once it's read enough bytes.