Open martingg88 opened 9 years ago
A site will have a finite number of pages. The crawler avoids cycles by keeping a set of previously visited urls. For example, link A references B. B is crawled, and it references A. A wont be recrawled, since its URL is in the set of seen urls.
In addition, all URLS are normalized before they are crawled and stored in the visited urls set. This helps avoid duplicate page crawls. Here's an example:
http://foo.com/people?age=30&filter=joe&sort=up https://foo.com/people?age=30&sort=up&filter=joe
In this case, the urls differ, but in most cases these will produce the same response. You can read more about roboto's normalization routine here: https://github.com/jculvey/roboto#url-normalization
one last question here. does it support stop and resume feature?
Nope, not yet. Sorry :/
It's one of the things people have asked for. I'll look into adding it soon.
Would having something like redis or sqlite as a dependency be an issue for you?
great. thanks. How about waterline adapter that developer can have his/her choice for any database available in node.js ecosystem?
here is the reference for waterline adapter.
will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?