istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

Stats keys flooding redis + Crawl resilience + retry strategy #255

Closed benjaminelkrieff closed 3 years ago

benjaminelkrieff commented 3 years ago

Hello Madison, and thanks again for providing support to the community.

As the title indicates I have 3 things I want to clarify regarding the system:

1) We use Kubernetes as our cloud solution and we made a lot of tests, and it created different pods, i.e. different hosts. Now when looking at the keys in redis, we have more than 2000 keys, with 99% being keys starting with stats:crawler:host-name:spider:200:lifetime, etc ...

2) Crawl resilience. When we have crawls and our system crashes in the middle of some crawl, will these crawls be lost ? Or as expected, will they be retried when the system is up again ?

3) How could we implement a smart retry strategy like in the Sidekiq framework in Rails? A strategy that would delay crawls according to certain factors. The current strategy we use is resending a scrapy request when an exception is caught. But how could we make it smarter and add some custom delays ? And would we be able to see the crawls in retry or the fail crawls in the elk logs ?

Thank you for your time

madisonb commented 3 years ago
  1. I recommend disabling the stats collection by default - it's only helpful if you want the system to be independent from other logging/aggregation frameworks that help view statistics about your system. See here for a further explanation. You will not hurt your system by deleting them - feel free to disable across the board

  2. Since scrapy cluster is built on top of scrapy, you get the resilience scrapy provides. I recommend using custom middlewares to catch downloader issues, or like you said some try/catch blocks to help catch parsing errors in your spider. I'm not sure what type of errors are happening but between the scrapy logs and scrapy cluster logs, you should be able to find your issue and help correct it. Yielding requests will make sure it is persistent in case of other crashes

  3. The scrapy scheduler is helpful here for this use case. I'm not familiar with Sidekiq but you can control how often scrapy checks for new crawls, as well as the priority thatt the crawl is put into the backlog to be collected. By default the spiders fetch highest priority crawls first, so depending on your throttle settings you can control how fast the spider crawls and in what order the sites will be crawled in. You can also yield across spiders so you can have different settings for errors or problem sites via a totally different spider. Yes, if you have your ELK stack working right you can see the 504's, 200's, 404's etc and what the spiders are doing.

benjaminelkrieff commented 3 years ago

Thank you for your answer! About 2), I have a spider with pretty long crawls, and if before the yield response is called, the system crashes, then I'm afraid that this crawl is ignored, never retried again. Ideally, I would need a mechanism that would execute crawls that are found in redis, and remove the crawl from redis only once it has been treated. That way when the system crashes, then when it is re-spawned, the crawl is retried since its key has been found in Redis, at a pace that is dictated by the priority policy as you said in 3). I think you should be able to tell me what type of behaviour the scrapy cluster adopts regarding fault tolerance. Does Scrapy cluster provide such a feature or do I need to tweak-modify things ?

About 3) Do you think decreasing the priority at each retry (when I say retry I talk about yield a new request calling parse recursively, if an exception has been caught) is a good regulating option ?

madisonb commented 3 years ago

In flight requests that "crash" mid-filght are not automatically retried. Normally, there is something that triggers the failure like a timeout or unexpected response; and like I said before you can add a middleware to catch them and retry it (like the one that comes by default with this project). There are only so many places you can place a middleware in the scrapy spider/downloader request stack; or you can add custom logic to catch an exception raised in the spider to then re-yield the request. Either way, you need to re-yield that inflight request.

For your second question, decreasing the priority is what:

and is a standard practice for my organization as well. Remember the queue scrapy cluster reads from is a priority queue, so feel free to tune as needed.

To be clear - there is no special handling of in flight requests in process by Scrapy, but if Scrapy exposes that kind of variable I would consider a PR that stores it temporarily in redis (and removes it) when it is compete. Perhaps that can be done with a middleware to store/clean up the in flight request. You would want to also write a small redis monitor plugin that scanned for those crashed requests and if their lifetime was greater than your threshold just add it back to the normal queue.

For clusters that process millions of requests per day or per hour, this might not be viable solution, but you can modify the project to your hearts content. I am an advocate for never allowing mysterious "crashes" to plague any system, there is always a reason and a solution that can be implemented to avoid the crawl "crashing".

benjaminelkrieff commented 3 years ago

Thank your answer! Sorry for the delay

madisonb commented 3 years ago

If this satisfies your question, please close the ticket 👍