Closed sjparsons closed 12 years ago
Actually, yes this is a great idea. A long time ago, when I had originally wrote this I had a separate table for links that were crawled but I removed it because it was redundant and pretty much doubled the size of the db which I thought was bad. I thought I had some logic in there to keep it from doing redundant requests, but either it doesn't work or I never implemented it like I thought. So yea, this is a good idea, go for it.
Closing old issues after completely rewriting the program
I'm wondering next about adding some functionality to reduce the number of requests to the same pages.
For example, imagine the following site.
When
index.php
is parsed, bothpage.php
andpage2.php
are added to the queue to get crawled. Whenpage1.php
gets crawled,page2.php
is again added to the queue to get crawled. The result is thatpage2.php
will actually get crawled twice.The benefit of how things are setup currently is that the
crawl_index
table is left with a complete record of the links to and from any given page. This is a nice feature and I wouldn't want to lose it. However, it would be nice to reduce the redundant crawling. (on a big site that I'm working on, this might mean that I'll actually get through the site!)What I'm thinking of is, setting up the crawler to check to see whether a record for the URL already exists in
crawl_index
whenever it grabs a new URL from thequeue
. If a record does exists, the crawler would simply add the URL with the correctcrawlid
andparentid
but the already discoveredtitle
,keyword
, andstatus
. The crawler would neither re-get or re-parse the current URL.Any thoughts on this before I work up and possible implementation?