Closed GoogleCodeExporter closed 9 years ago
A follow-up (although still need help...):
I added a
private String domainName = null;
together with the WebCrawl extender before shouldVisit and track (and save) the
first domain name that is parsed. This works OK for just 1 crawler instance
though, and missed the multi-threading point. If I add more instances and the
first link to be passed on to another thread is on a different domain, that
thread will use the new (different) domain as a valid/acceptable one. So each
thread may only be accepting pages from (only) the wrong domain.
Any views?
Thanks.
Original comment by georg...@thoughtified.com
on 10 Jan 2011 at 4:21
I guess the solution will be to have an array of the various domain names in a
static or enum class.. this way inside the shouldVisit function you can check
against those
like this
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
foreach(domains::getArray() as domain){
if (href.startsWith(domain)) {
return true;
}
}
return false;
}
///atleast you get the idea of what I mean
Original comment by mambena...@gmail.com
on 26 Jan 2011 at 5:40
That's very helpful indeed! Thanks.
Original comment by georg...@thoughtified.com
on 27 Jan 2011 at 8:22
Hello, thank you very much for this great tool. Concerning the comment made by
user mambena... on Jan 26, 2011. I guess that this means that one needs to know
in advance which domains need to be crawled. Is there a way to restrict which
domains will be crawled, without knowing them in advance? For example, I am
reading a list of company names from a text file, and searching for each in
turn with the Google "I'm feeling lucky" option. E.g. if I have "BBC, London"
in the text file, I create the following URL in the Controller class, and use
this as the seed e.g. I have:
controller.addSeed("http://www.google.com/search?hl=de&source=hp&q=BBC+London&btnI=I%27m+Feeling+Lucky&aq=f&aqi=g10&aql=&oq=&gs_rfai=")
I then want the crawler to stay on the BBC home page, and just crawl this. I
could add the BBC domain to an array as suggested above, or hardcode it in
shouldVisit, but my input file will contain several thousand company names, and
I cannot manually hardcode each. Is there a way to achieve this?
Many thanks in advance for any help that anyone can offer!
Original comment by andrew.h...@qbis.ch
on 7 Feb 2011 at 11:45
A more flexible approach can be as mentioned here
http://code.google.com/p/crawler4j/issues/detail?id=94#c1
Original comment by tahs...@trademango.com
on 10 Jan 2012 at 7:38
How can I eliminate error at "CrawlController controller = new
CrawlController("/data/crawl/root");" in Controller.java in crawler4j
Original comment by priyanka...@gmail.com
on 9 Oct 2012 at 3:21
Not a bug or feature request
Original comment by avrah...@gmail.com
on 11 Aug 2014 at 12:44
Original issue reported on code.google.com by
georg...@thoughtified.com
on 10 Jan 2011 at 1:27