istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 323 forks source link

Crawl complete signal #193

Closed wysiwygism closed 6 years ago

wysiwygism commented 6 years ago

Hi,

Is there a proper way to know when specific crawl (crawlid) is complete/finished. I tried querying for info {action: "info"}, but there are not reliable parameters to indicate that the website crawl was complete.

Thanks

madisonb commented 6 years ago

What constitutes a completed crawl? With webpages being highly dynamic, unless you know the exact number of pages with links ahead of time, it's difficult to tell when the crawl has actually stopped. There are a number of factors here

This is also compounded by the fact that the crawler is a distributed system and not all crawlers know or understand what the other is doing. Some mitigation strategies that are in this project:

  1. You can use the work done in this PR https://github.com/istresearch/scrapy-cluster/pull/165 to limit the number of pages for any given domain
  2. You can use the expires flag on your crawl request to stop the crawl after a certain period of time
  3. You can manually send a stop (scroll down) action to halt the crawl when you want
  4. You can send an info (scroll down to Crawl ID Information Request) action to get specific info about the crawl. If total_pending is 0, you know your crawl is finished.
  5. You can limit the maximum depth of the crawl via the maxdepth parameter, so your crawl does not go on for too deep into the page(s) you are interested in.
  6. You can use deny_extensions, deny_regex, allow_regex, and allowed_domains to control what website links you crawl

In typical use cases, a combination of the above ideas will help you tune your system to know when the styles of crawl requests you generate is complete.

If this answers your question, please close the ticket!

wysiwygism commented 6 years ago

Thanks for detailed answer.

I am trying to crawl a website. Its unknown to me how many pages I will discover. But I need all I can discover, so I cannot limit time or page count. Also, after crawling process is done, my system starts another process that is dependant on full crawl. So, it would be great if we had some kind of definitive signal.

The best indication of end of crawling is indeed "total_pending" in action info response. But there are some cases when total_pending is 0 and the crawl process is still running. I think it happens when spider gets error from hitting an url.

Is there any plugin/extension style approach to this problem?

madisonb commented 6 years ago

The total_pending can be 0 if all the spiders are in the process of executing the crawl, but have not received the new html to download or have not generated new links in the case of maxdepth > 0. I would say this is common at lower crawl depths but less common at large crawl depths.

Presuming you don't want your crawl to run to infinity, can you put a hard stop at a certain timestamp? Or use multiple pieces of information to make your decision? For example, if (total_pages_crawled > 100 OR time_elapsed > 1000 secs) AND pending_pages = 0 then stop crawl and presume is complete.

wysiwygism commented 6 years ago

Ok, thanks for help.

RochdiBoudokhane commented 3 years ago

hello there i am using scrapy-cluster for scraping specifics items and scroll the entire pages(~20 pages) but when i running the spider he scroll only 8 or 9 or 10 pages do you have any suggestion how to solve it thanks