infinite crawl - Githubissues

martingg88 commented 9 years ago

will this cause infinite crawl for the bigger site? what is strategy can be use to crawl the website in efficient way?

jculvey commented 9 years ago

A site will have a finite number of pages. The crawler avoids cycles by keeping a set of previously visited urls. For example, link A references B. B is crawled, and it references A. A wont be recrawled, since its URL is in the set of seen urls.

In addition, all URLS are normalized before they are crawled and stored in the visited urls set. This helps avoid duplicate page crawls. Here's an example:

http://foo.com/people?age=30&filter=joe&sort=up https://foo.com/people?age=30&sort=up&filter=joe

In this case, the urls differ, but in most cases these will produce the same response. You can read more about roboto's normalization routine here: https://github.com/jculvey/roboto#url-normalization

martingg88 commented 9 years ago

one last question here. does it support stop and resume feature?

jculvey commented 9 years ago

Nope, not yet. Sorry :/

It's one of the things people have asked for. I'll look into adding it soon.

Would having something like redis or sqlite as a dependency be an issue for you?

martingg88 commented 9 years ago

great. thanks. How about waterline adapter that developer can have his/her choice for any database available in node.js ecosystem?

here is the reference for waterline adapter.

https://github.com/balderdashy/waterline

jculvey / roboto

infinite crawl #12