ability to handle AutoExtractError

Problem statement

A typical scenario when using the Scrapy middleware to auto-extract e.g. product page URLs is that said URLs may respond with 404 status.

However, the library does not provide a way to handle the associated AutoExtractErrors. It seems that only successful (w.r.t to the domain crawled, not the auto-extract API) requests are returned from the middleware, with the rest of them (non-successful) simply logged:

...
if result.get('error'):
    self.inc_metric('autoextract/errors/result_error')
    self._log_debug_error(response, body)
    raise AutoExtractError('Received error from AutoExtract for {}: {}'.format(url, result["error"]))
...

Example

This is the output I get when I try to crawl the 404 URL: https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html

2021-12-13 12:54:43 [scrapy_autoextract.middlewares] DEBUG: Process AutoExtract request for product URL <GET https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html>
2021-12-13 12:54:55 [scrapy_autoextract.middlewares] DEBUG: AutoExtract response status=200  headers={'date': 'Mon, 13 Dec 2021 10:54:44 GMT', 'content-type': 'application/json', 'strict-transport-security': 'max-age=0; includeSubDomains; preload'}  content=[{"query":{"id":"1639392884013-e7d673376b493f68","domain":"dosfarma.com","userAgent":"scrapy-autoextract/0.5.2 scrapy/2.4.1","userQuery":{"url":"https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html","pageType":"product"}},"error":"Downloader error: http404"}]
2021-12-13 12:54:55 [scrapy.core.scraper] ERROR: Error downloading <POST https://autoextract.scrapinghub.com/v1/extract>
Traceback (most recent call last):
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
StopIteration: <200 https://autoextract.scrapinghub.com/v1/extract>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 55, in process_response
    response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
  File "/home/iantonopoulos/.cache/pypoetry/virtualenvs/src-7UZI2jN5-py3.7/lib/python3.7/site-packages/scrapy_autoextract/middlewares.py", line 190, in process_response
    '{}: {}'.format(url, result["error"]))
scrapy_autoextract.middlewares.AutoExtractError: Received error from AutoExtract for https://www.dosfarma.com/30583-bossauto-mascarilla-ffp2-con-valvula-20.unidades.html: Downloader error: http404
2021-12-13 13:00:21 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-13 13:00:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'autoextract/errors/result_error': 1,
 'autoextract/request_count': 1,
 'downloader/request_bytes': 460,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 445,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 31.248149,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 12, 13, 11, 0, 21, 791393),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 9,
 'memusage/max': 70422528,
 'memusage/startup': 70422528,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2021, 12, 13, 10, 59, 50, 543244)}

With this information (a DEBUG-level log + an increase in the metric autoextract/errors/result_error) user does not have access to the information contained in the unsuccessful responses, which may very well be important for many applications. Parsing the DEBUG logs seems a subpar practice, since deployed applications typically log statements with a level of WARNING and above.

Proposal

A refactoring of the (at least) the process_response method of the AutoExtractMiddleware in order to return a more unified response that covers all cases. For example, unsuccessful (w.r.t to the domain crawled, not the auto-extract API) responses should contain the Downloader error: http404.

scrapinghub / scrapy-autoextract

ability to handle AutoExtractError #33