Open ilias-ant opened 2 years ago
Hi @ilias-ant, thanks for raising this!
Would something like https://github.com/scrapinghub/scrapy-autoextract/pull/34 ease the issue? This introduces AUTOEXTRACT_RESPONSE_ERROR_LOG_LEVEL
and AUTOEXTRACT_ALLOWED_RESPONSE_ERRORS
where the users would have a finer control over the errors and logs.
The logging default is still set to logging.DEBUG
though to prevent existing users' logging records from being filled up unexpectedly if we raise it to a higher level. Nonetheless, it could still be overridden easily.
Problem statement
A typical scenario when using the Scrapy middleware to auto-extract e.g. product page URLs is that said URLs may respond with
404
status.However, the library does not provide a way to handle the associated
AutoExtractError
s. It seems that only successful (w.r.t to the domain crawled, not the auto-extract API) requests are returned from the middleware, with the rest of them (non-successful) simply logged:Example
This is the output I get when I try to crawl the 404 URL: https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html
With this information (a
DEBUG
-level log + an increase in the metricautoextract/errors/result_error
) user does not have access to the information contained in the unsuccessful responses, which may very well be important for many applications. Parsing theDEBUG
logs seems a subpar practice, since deployed applications typically log statements with a level ofWARNING
and above.Proposal
A refactoring of the (at least) the
process_response
method of theAutoExtractMiddleware
in order to return a more unified response that covers all cases. For example, unsuccessful (w.r.t to the domain crawled, not the auto-extract API) responses should contain theDownloader error: http404
.