open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Consider replacing handle_http_error with spider middleware #533

Open jpmckinney opened 3 years ago

jpmckinney commented 3 years ago

We set HTTPERROR_ALLOW_ALL = True. If we had left it to False, HttpErrorMiddleware would have raised an HttpError exception, which subclasses IgnoreRequest – a special exception class that gets ignored by Scrapy. That middleware also implements process_spider_exception to handle that exception and log and count the HTTP errors.

Assuming we can write a new spider middleware to handle the HttpError exception first, we can have it return FileError items instead. That way, we can remove all the @handle_http_error decorators.

Some spiders handle HTTP errors in special ways. For those spiders, the handle_httpstatus_list spider attribute can be set, as documented by HttpErrorMiddleware. They include spiders using:

jpmckinney commented 3 years ago

Since we want to use handle_http_error on some request callbacks but not all request callbacks, I think it's simplest to leave it as a decorator. For example, the Paraguay spiders use handle_http_error for data requests, but manually handles errors for access token requests.

jpmckinney commented 3 years ago

Actually, nevermind - we can just use a request meta attribute to enable/disable the proposed middleware for cases like the Paraguay spiders.

jpmckinney commented 2 years ago

Another – maybe more appropriate – option is to use request errbacks with HTTPERROR_ALLOW_ALL = False (the default): https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=exceptions#using-errbacks-to-catch-exceptions-in-request-processing