Open jpmckinney opened 3 years ago
Since we want to use handle_http_error
on some request callbacks but not all request callbacks, I think it's simplest to leave it as a decorator. For example, the Paraguay spiders use handle_http_error
for data requests, but manually handles errors for access token requests.
Actually, nevermind - we can just use a request meta attribute to enable/disable the proposed middleware for cases like the Paraguay spiders.
Another – maybe more appropriate – option is to use request errbacks with HTTPERROR_ALLOW_ALL = False
(the default): https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=exceptions#using-errbacks-to-catch-exceptions-in-request-processing
We set
HTTPERROR_ALLOW_ALL = True
. If we had left it toFalse
, HttpErrorMiddleware would have raised anHttpError
exception, which subclassesIgnoreRequest
– a special exception class that gets ignored by Scrapy. That middleware also implementsprocess_spider_exception
to handle that exception and log and count the HTTP errors.Assuming we can write a new spider middleware to handle the
HttpError
exception first, we can have it return FileError items instead. That way, we can remove all the@handle_http_error
decorators.Some spiders handle HTTP errors in special ways. For those spiders, the
handle_httpstatus_list
spider attribute can be set, as documented by HttpErrorMiddleware. They include spiders using:is_http_success
response.status