Implement policies for the behaviour of the spider. These need to be able to interface with the python scripting that is planned.
Policies:
Selection policy - Which sites to download from.
Re-visit policy - If the crawl goes over a long period of time, will it check earlier urls and whether they hae been updated?
Politeness policy - How often to crawl, whether to conform to the robots.txt etc.
Paralellization policy - How does it make use of the threads available?
Implement policies for the behaviour of the spider. These need to be able to interface with the python scripting that is planned.
Policies:
Selection policy - Which sites to download from.
Re-visit policy - If the crawl goes over a long period of time, will it check earlier urls and whether they hae been updated?
Politeness policy - How often to crawl, whether to conform to the robots.txt etc.
Paralellization policy - How does it make use of the threads available?