Closed radostyle closed 5 months ago
Looks good to me, and I agree that
parse
should never have been the default, but I don’t see what this has to do withprocess_spider_exception
. That catches exceptions from a callback or an earlier spider middleware, while errback handles exceptions from the download handler and downloader middlewares, as far as I recall.If your goal is to catch download exceptions, you might want to look into
process_exception
from downloader middlewares instead ofprocess_spider_exception
from spider middlewares.
Here is a cleanroom project where you can test it out. Exception is thrown in the middleware process_spider_input
and ends up in the parse
method instead of in process_spider_exception
. But when not running under scrapyrt it works as expected.
https://gitlab.com/jasource/scrapyrtexception
@pawelmhm I added some code to report user errors that occur in the spider_idle method to the api request, added a unit test, and modified existing unit tests. I also am not sure if we should keep backward compatibility for behavior which seems to be an error, but I'll let you make that call (we can just change the default to None instead of 'parse' in that case) Let me know what you think.
thanks @radostyle let's go with default errback None, I'll check and release this today.
I made changes to default to None here, added some more docs and also cleaned up error message a bit so that it doesn't return full traceback and logs info to file. https://github.com/scrapinghub/scrapyrt/pull/158
This patch makes the errback method that is called on the spider configurable.
By setting it to None in the configuration it allows the user to return scrapy to it's default exception handling.
Currently, in scrapyrt, there is the undocumented bug that exceptions are sent the 'parse' method.
Here is the commit where this was added: https://github.com/scrapinghub/scrapyrt/commit/75d2b3e9dc09a19db6f8dd159b210dd27b94214d
The errback should never have been defaulted to the 'parse' method of the spider. By doing this it invalidates what the scrapy docs say. Also, there is no documentation on the scrapy site that says that exceptions get sent to the parse method. The reason this was found is because the error handling in the
process_spider_exception
middleware was never getting called as the scrapy docs said it should be.The author was adding a feature so that one could pass the errback as a GET parameter as in the case for the "callback". It seems they copy-pasted the "callback" line to get it to work, not realizing that 'parse' was a bad default for errback.
The correct default is to allow the exception to propagate to the existing scrapy functionality.
For backwards compatibility for anyone who relies on now sending exceptions to their 'parse' method, this patch keeps the bug, but adds some documentation, and allows users who want the unmodified scrapy exception handling to get it back.