stewartmckee / cobweb

Web crawler with very flexible crawling options. Can either use standalone or can be used with resque to perform clustered crawls.
MIT License
227 stars 45 forks source link

Should it be possible to add "depth" in the data hash ? #28

Open ABrisset opened 10 years ago

ABrisset commented 10 years ago

Hello,

As far as I can see, the generated hash for each page doesn't include the "depth" information, that is to say how many clicks from the homepage each page is distant. Do you think it could be possible to add this option in the hash ? By the way, I really appreciate your gem, good work Stewart !

Thanks.

stewartmckee commented 9 years ago

I'm assuming you mean minimum depth. One of the misconceptions with navigation is that there is one way to reach a page. The depth of a page can be different depending on the route you take to get to the page. Also, where is the homepage? Is it the page you started the crawl from or the url with the shortest url?

If we took it as the first page that was crawled and passed a depth number down with the crawl it would not be guaranteed to give accurate results as each page is only processed once, and if there was a page that was linked to from the homepage (depth 1) but was actually crawled based on a sub page of the homepage it would have a depth of 2.

Its something to think about, I suppose if you specified a page as the root and then processed all pages crawled after completion for the shortest route (we have the data for that) then that would give the most accurate results. But again, html navigation is not a tree structure, its a node graph with multiple parents and interconnections.

nikhgupta commented 9 years ago

Thats correct, and that it would be inaccurate to report depth when processing the content. However, is there a way we can limit the crawl to a certain depth?

Lets say, we start from the seed url, and we only prefer to go 2 pages deep within the navigation. Is that possible with CobWeb? This is certainly possible with Anemone crawler, but it is an old gem, now. I love the way CobWeb uses Sidekiq/Resque jobs, and would really prefer to limit the crawl depth for the crawler.

Between, thanks again for the awesome gem. Really useful.

colnpanic commented 9 years ago

I agree on both points, this is a really cool gem :+1: and would like to have a "max_depth" option. I totally understand that we're not dealing with tree data and that "depth" is relative, but it would still be useful. The nice thing it would give you is a chance to do a quick test of the "core" links from a page, following just a couple without processing the entire site so you can preview some results without waiting for the whole site to process.