Open oscarrobertson opened 4 years ago
Another option would be to somehow add the failed numbers per request type to stats, this is more flexible than emitting and then counting WARNINGs and ERRORs.
Yes adding to stats would definitely be part of it. If I were to write a custom Scraper class I would include that. Are you suggesting there be no "threshold" settings and the user should somehow periodically check the stats to see if they have exceeded acceptable levels? I'm not against a solution like this but there would need to be some way to turn off the default logging in Scraper https://github.com/scrapy/scrapy/blob/c911e802097ecd3309bb826d48b7b08ce108f4ce/scrapy/core/scraper.py#L199
the user should somehow periodically check the stats
Please take a look at https://github.com/scrapinghub/spidermon :)
Summary
It would be nice to have the concept of request importance to affect the logging level of download failures, i.e. a way to yield a request but say, if this request has a download failure then only log a warning. A feature to keep track of how many requests of low importance have failed so far would also be useful, as there could be a threshold on the number of failures where error logging starts to kick back in.
Motivation
We use scrapy extensively and we track log files to help us monitor scraping processes. Imagine a scrape that yields one request (R1) to start, then in the first callback it yields 10000 requests (R2s). From a monitoring perspective if R1 fails that is a huge problem, but if one of the R2s fails I don't really care. If lots of the R2s fail though that is a big deal.
If I see an error in the log files for code I did not write, it's not immediately clear to me if the failure is an R1 or R2 request, I have to go and read the code to find out.
Additional context
It looks like all we need to do is override the Scraper class slightly, and maybe write new Request types. It's kind of hard to plug in a custom Scraper class currently, this line could be changed to pull a class from settings like the settings above it though https://github.com/scrapy/scrapy/blob/c911e802097ecd3309bb826d48b7b08ce108f4ce/scrapy/core/engine.py#L70 I thought I would reach out to see if this behavior is something that might be useful to others as part of the default Scraper, we're happy to contribute a PR if so.