jpmckinney commented 4 years ago

Original issue title: Explore Item Exporters instead of bespoke system for writing items

Original issue description: I haven't had time to look at this closely, but on the surface it seems to do the same thing. Ideally, we can substitute Item Exporters without causing any change in behavior.

https://docs.scrapy.org/en/latest/topics/exporters.html

jpmckinney commented 4 years ago

Background / Research

Item exporters are designed to open a file, start_exporting(), export_item() however many times, then finish_exporting(). In our case, we want one file per item. Item exporters aren't designed for that; it would be a huge overhead to have one item exporter per item.

As for files pipelines, that was used in the 'old pipeline' #115 #51. File pipelines are designed for a case where a spider gets a resource, and wants to schedule the download of that resource's media (images, PDFs, etc.); by design, a resource might lead to further requests (e.g. a next page link in the resource), whereas media don't lead to further requests. However, in our case, the resource is also the media that we want to download and store (most of the time). In the 'old pipeline', this led to duplicate requests (once for the resource by the spider, and once for the media by the files pipeline). Although Scrapy can cache requests, in principle we shouldn't use the files pipeline, because we don't have a clean separation between resources and media. (NB: We still use the FILES_STORE variable, which is meaningful to the files pipeline, but for different purposes.)

Discussion

Taking a step back: Right now, the spider writes files itself, which doesn't make sense in Scrapy's architecture. The spider is only expected to yield items, which then go through an item pipeline to be dropped, errored, or passed through, and which then send an item_scraped, item_dropped or item_errored signal.

We should reserve the item pipeline for the types of operations for which it was intended, e.g. deduplication, filtering, cleaning, validating, etc. For example, we had discussed adding support for specific publishers' API features via spider arguments. In cases where a publisher lacks an API feature (e.g. filtering), we could instead perform the filtering in the item pipeline. This, of course, would only work if the spider isn't taking responsibility for writing files.

Proposal

The proposal is for the spider to stop writing files. To avoid rewriting all existing spiders in the short term, we can just change the relevant methods in the spider to return an item (as they do now) that includes the response/data that was being written.

Then, whatever items make it through the item pipeline can be written to disk, by connecting the item_scraped signal to a new extension (which will be given higher priority in settings.py than the Kingfisher Process API extension #274). This new extension will basically do what the spider had been doing.

The one thing to check is scrapyd's default behavior for items. Does it not write any output unless explicitly configured (I think so), or will it write the item stream somewhere? For our use case, we write the response data and send the item to the Kingfisher Process API; we don't need to write the item to disk.

yolile commented 4 years ago

The one thing to check is scrapyd's default behavior for items. Does it not write any output unless explicitly configured (I think so), or will it write the item stream somewhere? For our use case, we write the response data and send the item to the Kingfisher Process API; we don't need to write the item to disk.

The items are sent to the standard output, the console, as a log with DEBUG level

jpmckinney commented 4 years ago

Note from call today: We need to check if/how we're configuring Scrapy/Scrapyd logging, since we don't want a log file that is as large as all the data we're downloading :)

jpmckinney commented 4 years ago

Noting that I had disabled the items_dir setting in our deployment, so Scrapyd won't write the items anywhere besides the log: https://github.com/open-contracting/kingfisher-collect/issues/371#issuecomment-620226737

open-contracting / kingfisher-collect

Use item_scraped signal in extension, instead of writing files in spider #277

Background / Research

Discussion

Proposal