Open leewesleyv opened 1 month ago
It seems like the file stores do not implement methods for downloading/retrieving a file from it. The closest method it has to obtaining information (used in the pipelines) is stat_file
that retrieves metadata for a file (checksum, last modified). For example looking into S3FilesStore.stat_file
where it calls s3_client.head_object
. From the docs:
The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you’re interested only in an object’s metadata.
Similar logic is implemented for the other stores.
This means that we will be responsible for implementing this functionality. The only question that remains is wether we should extend the current stores, or approach it like we do now with some helper functions.
Sounds like it does make sense to take the shortest route (in terms of effort needed) to get this working in a reasonable time. Eventually, this would best be added to Scrapy, so other extensions/middlewares could make use of this too, so this is work that is useful anyway.
Perhaps an order of things could be:
stat_file
and download_file
(and the things we ended up on in the previous step) would be welcome in Scrapy, e.g. by opening an issue there to consider its inclusion.
Currently we create the clients for fetching files from cloud providers ourselves (in
utils.py
/wacz.py
). Ideally, we want to re-use the functionality that Scrapy has for this to reduce the complexity it brings (testing/maintaining).