mohankreddy / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

Examples crash #169

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. use examples.basic.basicCrawlcontroller
2.pass in file path and threads 3
3.

What is the expected output? What do you see instead?
dont expect it to crash!!

What version of the product are you using?
latest

Please provide any additional information below.

Crashing HERE!!
if (TLDList.contains(domain)) {

in webURL.java

log4j:WARN No appenders could be found for logger 
(org.apache.http.impl.conn.tsccm.ThreadSafeClientConnManager).
log4j:WARN Please initialize the log4j system properly.
Exception breakpoint: Reader.java:61, java.lang.NullPointerException, 
Exception in thread "main" java.lang.ExceptionInInitializerError
    at edu.uci.ics.crawler4j.url.WebURL.setURL(WebURL.java:95)
    at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(CrawlController.java:347)
    at edu.uci.ics.crawler4j.crawler.CrawlController.addSeed(CrawlController.java:303)
    at edu.uci.ics.crawler4j.examples.basic.BasicCrawlController.main(BasicCrawlController.java:108)
Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:61)
    at java.io.InputStreamReader.<init>(InputStreamReader.java:55)
    at edu.uci.ics.crawler4j.url.TLDList.<clinit>(TLDList.java:19)
    ... 4 more

Original issue reported on code.google.com by practica...@gmail.com on 16 Aug 2012 at 11:36

GoogleCodeExporter commented 9 years ago
Found three bugs

1. TLDList does not need to be static
2. the initializer needs to be:
    private static Set<String> tldSet= new HashSet<String>();
NOT
private static Set<String> tldSet

There is a race condition where  "contains" is called before initialisation 
with  tldSet still NULL;

3.
        {try 
                {
            tldSet = new HashSet<String>();
        URL url = getClass().getResource("/main/resources/tld-names.txt");

        BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));

I have changed the class loader for the resource so that it works with 
debugging correctly.
the class loader does not always return the correct path.

Original comment by practica...@gmail.com on 16 Aug 2012 at 12:09

GoogleCodeExporter commented 9 years ago
Have the same error, how do you fix it?

Original comment by AldoCast...@gmail.com on 11 Nov 2012 at 12:02

GoogleCodeExporter commented 9 years ago
I have the same error too. 
Hey practica...@gmail.com could you explain a little bit more how you fix the 
issue.

Thanks in advance.

Original comment by nikolov....@gmail.com on 5 Dec 2012 at 12:13

GoogleCodeExporter commented 9 years ago
I didn't , got fed up and wrote my own kit.

It does about 20 million pages a day, with a full page parse, problem is... it 
is so fast that some sites ban it very quickly.

Original comment by practica...@gmail.com on 5 Dec 2012 at 3:03

GoogleCodeExporter commented 9 years ago
I'm getting the same error. You would think the basic example would work out of 
the box. Is there a fix?

Original comment by jason.sc...@gmail.com on 4 Feb 2013 at 10:33

GoogleCodeExporter commented 9 years ago
This issue was closed by revision 3df4ae16409c.

Original comment by ganjisaffar@gmail.com on 3 Mar 2013 at 5:52