How to force crawler4j to stay within initial domain

GoogleCodeExporter commented 9 years ago

I'm hoping that someone can help me with this:

I am trying to force the crawler to only crawl data within a domain.

e.g. I will be passing several domains on to the crawlers (as individual 
executions) and would like them to only start and stop when having finished 
with any links in a domain's pages that point to other pages in the same domain.

I haven't managed to "pass" the Domain Name of the first DocId (first page that 
is being crawled, i.e. the starting Domain) on to the ShouldVisit function so 
that any other links are discarded...

Can anyone help with this?

I also tried adding

                 //               if (shouldVisit(cur) && canonicalUrl.startsWith(domainName)) {
                 //                   cur.setParentDocid(docid);
                 //                   toSchedule.add(cur);
                 //               }

within the preProcessPage (inside WebCrawler.java) but I can't pass the 
"domainName" to this function, as it's executed by "run", and the thread won't 
take arguments...

Original issue reported on code.google.com by georg...@thoughtified.com on 10 Jan 2011 at 1:27

GoogleCodeExporter commented 9 years ago

A follow-up (although still need help...):

I added a
    private String domainName = null;
together with the WebCrawl extender before shouldVisit and track (and save) the 
first domain name that is parsed. This works OK for just 1 crawler instance 
though, and missed the multi-threading point. If I add more instances and the 
first link to be passed on to another thread is on a different domain, that 
thread will use the new (different) domain as a valid/acceptable one. So each 
thread may only be accepting pages from (only) the wrong domain.

Any views?

Thanks.

Original comment by georg...@thoughtified.com on 10 Jan 2011 at 4:21

GoogleCodeExporter commented 9 years ago

I guess the solution will be to have an array of the various domain names in a 
static or enum class.. this way inside the shouldVisit function you can check 
against those

like this 
public boolean shouldVisit(WebURL url) {
        String href = url.getURL().toLowerCase();
        if (filters.matcher(href).matches()) {
            return false;
        }
                foreach(domains::getArray() as domain){
        if (href.startsWith(domain)) {
            return true;

        }
                }
        return false;
    }

///atleast you get the idea of what I mean

Original comment by mambena...@gmail.com on 26 Jan 2011 at 5:40

GoogleCodeExporter commented 9 years ago

That's very helpful indeed! Thanks.

Original comment by georg...@thoughtified.com on 27 Jan 2011 at 8:22

GoogleCodeExporter commented 9 years ago

Hello, thank you very much for this great tool. Concerning the comment made by 
user mambena... on Jan 26, 2011. I guess that this means that one needs to know 
in advance which domains need to be crawled. Is there a way to restrict which 
domains will be crawled, without knowing them in advance? For example, I am 
reading a list of company names from a text file, and searching for each in 
turn with the Google "I'm feeling lucky" option. E.g. if I have "BBC, London" 
in the text file, I create the following URL in the Controller class, and use 
this as the seed e.g. I have:
                    controller.addSeed("http://www.google.com/search?hl=de&source=hp&q=BBC+London&btnI=I%27m+Feeling+Lucky&aq=f&aqi=g10&aql=&oq=&gs_rfai=")

I then want the crawler to stay on the BBC home page, and just crawl this. I 
could add the BBC domain to an array as suggested above, or hardcode it in 
shouldVisit, but my input file will contain several thousand company names, and 
I cannot manually hardcode each. Is there a way to achieve this?

Many thanks in advance for any help that anyone can offer!

Original comment by andrew.h...@qbis.ch on 7 Feb 2011 at 11:45

GoogleCodeExporter commented 9 years ago

A more flexible approach can be as mentioned here 
http://code.google.com/p/crawler4j/issues/detail?id=94#c1

Original comment by tahs...@trademango.com on 10 Jan 2012 at 7:38

GoogleCodeExporter commented 9 years ago

How can I eliminate error at "CrawlController controller = new 
CrawlController("/data/crawl/root");" in Controller.java in crawler4j

Original comment by priyanka...@gmail.com on 9 Oct 2012 at 3:21

GoogleCodeExporter commented 9 years ago

Not a bug or feature request

Original comment by avrah...@gmail.com on 11 Aug 2014 at 12:44

Changed state: Invalid
Added labels: Type-Other
Removed labels: Type-Defect

mohankreddy / crawler4j

How to force crawler4j to stay within initial domain #26