Closed GoogleCodeExporter closed 9 years ago
I've read my description again, and perhaps it is not exactly clear what's the
problem. Perhaps I haven't got the needed Java-knowledge but perhaps it actually
would be an enhancement.
The Controller starts the Crawler with:
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.ics.uci.edu/");
controller.start(MyCrawler.class, 10);
But how to stop the Crawler manually?
I would like to stop it depending on:
- The number of crawled pages for a seed (e.g. max. 2000 pages)
- Maximum number of urls having the same content (this is what i described
above)
I did not find a way to do this within shouldVisit or visit, so i have to
"hope",
that the crawl-process comes to an end.
Original comment by andreas....@googlemail.com
on 16 Apr 2010 at 8:44
You can have a static counter in your shouldVisit function that threads access
it
through a synchronized function. Something like this:
private static count = 0;
private static synchronized boolean shouldStop() {
count++;
if (count > 2000) {
return false;
}
return true;
}
Then you can call this function in your shouldVisit function. Regarding
detection of
similar topics, it's the responsibility of a module above crawler. Crawler is
only
responsible for fetching content. So, for example you can use hashing methods to
detect similar content and stop following its links.
Original comment by ganjisaffar@gmail.com
on 16 Apr 2010 at 9:22
Original issue reported on code.google.com by
andreas....@googlemail.com
on 15 Apr 2010 at 12:36