scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.88k stars 10.53k forks source link

Importance of Requests #4156

Open oscarrobertson opened 4 years ago

oscarrobertson commented 4 years ago

Summary

It would be nice to have the concept of request importance to affect the logging level of download failures, i.e. a way to yield a request but say, if this request has a download failure then only log a warning. A feature to keep track of how many requests of low importance have failed so far would also be useful, as there could be a threshold on the number of failures where error logging starts to kick back in.

Motivation

We use scrapy extensively and we track log files to help us monitor scraping processes. Imagine a scrape that yields one request (R1) to start, then in the first callback it yields 10000 requests (R2s). From a monitoring perspective if R1 fails that is a huge problem, but if one of the R2s fails I don't really care. If lots of the R2s fail though that is a big deal.

If I see an error in the log files for code I did not write, it's not immediately clear to me if the failure is an R1 or R2 request, I have to go and read the code to find out.

Additional context

It looks like all we need to do is override the Scraper class slightly, and maybe write new Request types. It's kind of hard to plug in a custom Scraper class currently, this line could be changed to pull a class from settings like the settings above it though https://github.com/scrapy/scrapy/blob/c911e802097ecd3309bb826d48b7b08ce108f4ce/scrapy/core/engine.py#L70 I thought I would reach out to see if this behavior is something that might be useful to others as part of the default Scraper, we're happy to contribute a PR if so.

wRAR commented 4 years ago

Another option would be to somehow add the failed numbers per request type to stats, this is more flexible than emitting and then counting WARNINGs and ERRORs.

oscarrobertson commented 4 years ago

Yes adding to stats would definitely be part of it. If I were to write a custom Scraper class I would include that. Are you suggesting there be no "threshold" settings and the user should somehow periodically check the stats to see if they have exceeded acceptable levels? I'm not against a solution like this but there would need to be some way to turn off the default logging in Scraper https://github.com/scrapy/scrapy/blob/c911e802097ecd3309bb826d48b7b08ce108f4ce/scrapy/core/scraper.py#L199

wRAR commented 4 years ago

the user should somehow periodically check the stats

Please take a look at https://github.com/scrapinghub/spidermon :)