Closed wysiwygism closed 6 years ago
What constitutes a completed crawl? With webpages being highly dynamic, unless you know the exact number of pages with links ahead of time, it's difficult to tell when the crawl has actually stopped. There are a number of factors here
This is also compounded by the fact that the crawler is a distributed system and not all crawlers know or understand what the other is doing. Some mitigation strategies that are in this project:
expires
flag on your crawl request to stop the crawl after a certain period of timeCrawl ID Information Request
) action to get specific info about the crawl. If total_pending
is 0, you know your crawl is finished.maxdepth
parameter, so your crawl does not go on for too deep into the page(s) you are interested in.deny_extensions
, deny_regex
, allow_regex
, and allowed_domains
to control what website links you crawlIn typical use cases, a combination of the above ideas will help you tune your system to know when the styles of crawl requests you generate is complete.
If this answers your question, please close the ticket!
Thanks for detailed answer.
I am trying to crawl a website. Its unknown to me how many pages I will discover. But I need all I can discover, so I cannot limit time or page count. Also, after crawling process is done, my system starts another process that is dependant on full crawl. So, it would be great if we had some kind of definitive signal.
The best indication of end of crawling is indeed "total_pending" in action info response. But there are some cases when total_pending is 0 and the crawl process is still running. I think it happens when spider gets error from hitting an url.
Is there any plugin/extension style approach to this problem?
The total_pending
can be 0 if all the spiders are in the process of executing the crawl, but have not received the new html to download or have not generated new links in the case of maxdepth > 0. I would say this is common at lower crawl depths but less common at large crawl depths.
Presuming you don't want your crawl to run to infinity, can you put a hard stop at a certain timestamp? Or use multiple pieces of information to make your decision? For example, if (total_pages_crawled > 100 OR time_elapsed > 1000 secs) AND pending_pages = 0
then stop
crawl and presume is complete.
Ok, thanks for help.
hello there i am using scrapy-cluster for scraping specifics items and scroll the entire pages(~20 pages) but when i running the spider he scroll only 8 or 9 or 10 pages do you have any suggestion how to solve it thanks
Hi,
Is there a proper way to know when specific crawl (crawlid) is complete/finished. I tried querying for info
{action: "info"}
, but there are not reliable parameters to indicate that the website crawl was complete.Thanks