scrapinghub / scrapyrt

HTTP API for Scrapy spiders
BSD 3-Clause "New" or "Revised" License
824 stars 161 forks source link

Make the errback handling method configurable #156

Closed radostyle closed 5 months ago

radostyle commented 6 months ago

This patch makes the errback method that is called on the spider configurable.

By setting it to None in the configuration it allows the user to return scrapy to it's default exception handling.

Currently, in scrapyrt, there is the undocumented bug that exceptions are sent the 'parse' method.

Here is the commit where this was added: https://github.com/scrapinghub/scrapyrt/commit/75d2b3e9dc09a19db6f8dd159b210dd27b94214d

The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the process_spider_exception middleware was never getting called as the scrapy docs said it should be.

The author was adding a feature so that one could pass the errback as a GET parameter as in the case for the "callback". It seems they copy-pasted the "callback" line to get it to work, not realizing that 'parse' was a bad default for errback.

The correct default is to allow the exception to propagate to the existing scrapy functionality.

For backwards compatibility for anyone who relies on now sending exceptions to their 'parse' method, this patch keeps the bug, but adds some documentation, and allows users who want the unmodified scrapy exception handling to get it back.

radostyle commented 5 months ago

Looks good to me, and I agree that parse should never have been the default, but I don’t see what this has to do with process_spider_exception. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.

If your goal is to catch download exceptions, you might want to look into process_exception from downloader middlewares instead of process_spider_exception from spider middlewares.

Here is a cleanroom project where you can test it out. Exception is thrown in the middleware process_spider_input and ends up in the parse method instead of in process_spider_exception. But when not running under scrapyrt it works as expected. https://gitlab.com/jasource/scrapyrtexception

radostyle commented 5 months ago

@pawelmhm I added some code to report user errors that occur in the spider_idle method to the api request, added a unit test, and modified existing unit tests. I also am not sure if we should keep backward compatibility for behavior which seems to be an error, but I'll let you make that call (we can just change the default to None instead of 'parse' in that case) Let me know what you think.

pawelmhm commented 5 months ago

thanks @radostyle let's go with default errback None, I'll check and release this today.

pawelmhm commented 5 months ago

I made changes to default to None here, added some more docs and also cleaned up error message a bit so that it doesn't return full traceback and logs info to file. https://github.com/scrapinghub/scrapyrt/pull/158