theanti9 / PyCrawler

A python web crawler
212 stars 104 forks source link

Reducing Redundancy #6

Closed sjparsons closed 12 years ago

sjparsons commented 13 years ago

I'm wondering next about adding some functionality to reduce the number of requests to the same pages.

For example, imagine the following site.

index.php
-> page1.php
    -> page2.php
-> page2.php

When index.php is parsed, both page.php and page2.php are added to the queue to get crawled. When page1.php gets crawled, page2.php is again added to the queue to get crawled. The result is that page2.php will actually get crawled twice.

The benefit of how things are setup currently is that the crawl_index table is left with a complete record of the links to and from any given page. This is a nice feature and I wouldn't want to lose it. However, it would be nice to reduce the redundant crawling. (on a big site that I'm working on, this might mean that I'll actually get through the site!)

What I'm thinking of is, setting up the crawler to check to see whether a record for the URL already exists in crawl_index whenever it grabs a new URL from the queue. If a record does exists, the crawler would simply add the URL with the correct crawlid and parentid but the already discovered title, keyword, and status. The crawler would neither re-get or re-parse the current URL.

Any thoughts on this before I work up and possible implementation?

theanti9 commented 13 years ago

Actually, yes this is a great idea. A long time ago, when I had originally wrote this I had a separate table for links that were crawled but I removed it because it was redundant and pretty much doubled the size of the db which I thought was bad. I thought I had some logic in there to keep it from doing redundant requests, but either it doesn't work or I never implemented it like I thought. So yea, this is a good idea, go for it.

theanti9 commented 12 years ago

Closing old issues after completely rewriting the program