istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Redis queue takes a lot of memory. #123

Closed arheys closed 7 years ago

arheys commented 7 years ago

Hello, I`m facing a problem using a Redis(AWS ElastiCashe) instance with 27Gb of memory after 1mil pages parsed Redis memory is out. How can control queue, like a set of TTL of records? My settings.py ... DUPEFILTER_TIMEOUT = 60 ... SCHEDULER_QUEUE_TIMEOUT = 60

Suggest a solution, please.

madisonb commented 7 years ago

@arheys So it appears to me like you are ingesting crawl requests faster than your cluster can process them. So the flow is normally

Requests In → Redis Queue → Requests Scraped

where in your case, In > Out. You either need to scale up the number of spiders and machines in your cluster, decrease the speed at which you are generating requests, or adjust the domain throttle levels on your spiders so they crawl at a speed that is reasonable for you.

In normal operation your Redis Queue backlog should hover near 0, as you have enough spiders to cover how fast your requests come in. If your crawl backlog continues to grow, I would suggest you adjust your settings appropriately.

You might also want to check out the Production Setup documentation for further thoughts on deployment strategies.

If this answers your question please close this ticket.

arheys commented 7 years ago

Thanks for the reply. Ive made some experiments but redis queue still growing very fast( Im using default wandering spider with slight modifications. And Redis as ElastiCache AWS service. Here are my setting and spider. wandering_spider.txt settings.txt I`m scraping all links about 10 domains, and after about 1,2-1,3 million pages scraped - redis memory finish. I had run 72 spiders from 9 instances with ip, simultaneously. Thanks in advance for the answer.

madisonb commented 7 years ago

The wandering spider is just an example, are you sure you want to use it in production? If you want an on demand scraping cluster you should be using the link spider, not the wandering.

Since the wandering spider runs indefinitely until it can't find any new links, any new requests that come into the cluster continue to generate more links, and may not ever go away. I think the behavior you are seeing makes sense given the wandering spider's implementation.

If I understand what you are trying to do, you should use the link spider instead.

madisonb commented 7 years ago

If this doesn't answer your question please let us know, otherwise I am going to close this.