Closed arheys closed 7 years ago
@arheys So it appears to me like you are ingesting crawl requests faster than your cluster can process them. So the flow is normally
Requests In → Redis Queue → Requests Scraped
where in your case, In > Out
. You either need to scale up the number of spiders and machines in your cluster, decrease the speed at which you are generating requests, or adjust the domain throttle levels on your spiders so they crawl at a speed that is reasonable for you.
In normal operation your Redis Queue backlog should hover near 0, as you have enough spiders to cover how fast your requests come in. If your crawl backlog continues to grow, I would suggest you adjust your settings appropriately.
You might also want to check out the Production Setup documentation for further thoughts on deployment strategies.
If this answers your question please close this ticket.
Thanks for the reply.
Ive made some experiments but redis queue still growing very fast( I
m using default wandering spider with slight modifications.
And Redis as ElastiCache AWS service.
Here are my setting and spider.
wandering_spider.txt
settings.txt
I`m scraping all links about 10 domains, and after about 1,2-1,3 million pages scraped - redis memory finish. I had run 72 spiders from 9 instances with ip, simultaneously.
Thanks in advance for the answer.
The wandering spider is just an example, are you sure you want to use it in production? If you want an on demand scraping cluster you should be using the link
spider, not the wandering
.
Since the wandering spider runs indefinitely until it can't find any new links, any new requests that come into the cluster continue to generate more links, and may not ever go away. I think the behavior you are seeing makes sense given the wandering
spider's implementation.
If I understand what you are trying to do, you should use the link
spider instead.
If this doesn't answer your question please let us know, otherwise I am going to close this.
Hello, I`m facing a problem using a Redis(AWS ElastiCashe) instance with 27Gb of memory after 1mil pages parsed Redis memory is out. How can control queue, like a set of TTL of records? My settings.py ... DUPEFILTER_TIMEOUT = 60 ... SCHEDULER_QUEUE_TIMEOUT = 60
Suggest a solution, please.