mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

crawl to infinity #6

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Create a PHP-Script with the following content:

<?php
if(!isset($_SERVER["QUERY_STRING"]))
    $_SERVER["QUERY_STRING"] = "";
$link = $_SERVER["PHP_SELF"]."?".$_SERVER["QUERY_STRING"]."&test=test";
$link = str_replace("?&", "?", $link);
?>
<a href="<?php echo $link;?>">test</a>

2. Run your Crawler against this page

What is the expected output? What do you see instead?
Expected: Elegant way to prevent this behaviour
Instead: Only an URL-Object in shouldVisit to perform Checks

What version of the product are you using? On what operating system?
1.8.1, Windows XP

Please provide any additional information below.
Perhaps there is an elegant way, i am not aware of...
AND: I do not find a way to change this issue to an enhancement.

Original issue reported on code.google.com by andreas....@googlemail.com on 15 Apr 2010 at 12:36

GoogleCodeExporter commented 9 years ago
I've read my description again, and perhaps it is not exactly clear what's the
problem. Perhaps I haven't got the needed Java-knowledge but perhaps it actually
would be an enhancement.

The Controller starts the Crawler with:
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.ics.uci.edu/");
controller.start(MyCrawler.class, 10);  

But how to stop the Crawler manually?
I would like to stop it depending on:
- The number of crawled pages for a seed (e.g. max. 2000 pages)
- Maximum number of urls having the same content (this is what i described 
above)

I did not find a way to do this within shouldVisit or visit, so i have to 
"hope",
that the crawl-process comes to an end.

Original comment by andreas....@googlemail.com on 16 Apr 2010 at 8:44

GoogleCodeExporter commented 9 years ago
You can have a static counter in your shouldVisit function that threads access 
it
through a synchronized function. Something like this:

private static count = 0;

private static synchronized boolean shouldStop() {
    count++;
    if (count > 2000) {
       return false;
    }
    return true;
}

Then you can call this function in your shouldVisit function. Regarding 
detection of
similar topics, it's the responsibility of a module above crawler. Crawler is 
only
responsible for fetching content. So, for example you can use hashing methods to
detect similar content and stop following its links.

Original comment by ganjisaffar@gmail.com on 16 Apr 2010 at 9:22