q-m / scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
http://developers.thequestionmark.org/scrapy-webarchive/
2 stars 0 forks source link

Open WACZ files using the Scrapy stores #11

Open leewesleyv opened 1 month ago

leewesleyv commented 1 month ago

Currently we create the clients for fetching files from cloud providers ourselves (in utils.py/wacz.py). Ideally, we want to re-use the functionality that Scrapy has for this to reduce the complexity it brings (testing/maintaining).

leewesleyv commented 1 month ago

It seems like the file stores do not implement methods for downloading/retrieving a file from it. The closest method it has to obtaining information (used in the pipelines) is stat_file that retrieves metadata for a file (checksum, last modified). For example looking into S3FilesStore.stat_file where it calls s3_client.head_object. From the docs:

The HEAD operation retrieves metadata from an object without returning the object itself. This operation is useful if you’re interested only in an object’s metadata.

Similar logic is implemented for the other stores.

This means that we will be responsible for implementing this functionality. The only question that remains is wether we should extend the current stores, or approach it like we do now with some helper functions.

wvengen commented 4 weeks ago

Sounds like it does make sense to take the shortest route (in terms of effort needed) to get this working in a reasonable time. Eventually, this would best be added to Scrapy, so other extensions/middlewares could make use of this too, so this is work that is useful anyway.

Perhaps an order of things could be:

  1. Get clarity on what we'd need from an upstream change.
  2. Get clarity on whether stat_file and download_file (and the things we ended up on in the previous step) would be welcome in Scrapy, e.g. by opening an issue there to consider its inclusion.
  3. If yes, and this may be doable without a long long review process and many technical intricacies, try to get it upstream. If no, document our findings in this in the upstream issue, and solve the issue in this project. Ideally, this would eventually move to Scrapy (unless upstream decides it is not part of its scope).