PageFetcher.discardContentIfNotConsumed throws a lot of errors every time.

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1.  Just start a crawl for any site...
2.
3.

What is the expected output? What do you see instead?
process the page as per webcrawler process metho

What version of the product are you using?
3.0

Please provide any additional information below.
Whenever i start crawling any site, i see a lot of log4j errors being thorwn 
from PageFetcher -

java.lang.ArrayIndexOutOfBoundsException
        at java.net.SocketInputStream.read(Unknown Source)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:149)
        at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:111)
        at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:264)
        at org.apache.http.impl.conn.LoggingSessionInputBuffer.readLine(LoggingSessionInputBuffer.java:115)
        at org.apache.http.impl.io.ChunkedInputStream.getChunkSize(ChunkedInputStream.java:246)
        at org.apache.http.impl.io.ChunkedInputStream.nextChunk(ChunkedInputStream.java:204)
        at org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:132)
        at java.util.zip.CheckedInputStream.read(Unknown Source)
        at java.util.zip.GZIPInputStream.readUByte(Unknown Source)
        at java.util.zip.GZIPInputStream.readUShort(Unknown Source)
        at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at java.util.zip.GZIPInputStream.readUShort(Unknown Source)
        at java.util.zip.GZIPInputStream.readHeader(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at java.util.zip.GZIPInputStream.<init>(Unknown Source)
        at edu.uci.ics.crawler4j.fetcher.PageFetcher$GzipDecompressingEntity.getContent(PageFetcher.java:278)
        at org.apache.http.conn.BasicManagedEntity.getContent(BasicManagedEntity.java:88)
        at org.apache.http.util.EntityUtils.consume(EntityUtils.java:65)
        at edu.uci.ics.crawler4j.fetcher.PageFetcher.discardContentIfNotConsumed(PageFetcher.java:243)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:295)
        at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:178)
        at java.lang.Thread.run(Unknown Source)

Original issue reported on code.google.com by svu...@gmail.com on 4 Mar 2012 at 4:53

GoogleCodeExporter commented 8 years ago

As mentioned on the homepage, this version is deprecated because of this bug. 
Please upgrade to the latest version.

-Yasser

Original comment by ganjisaffar@gmail.com on 4 Mar 2012 at 5:04

Changed state: Invalid
Added labels: Type-Other
Removed labels: Type-Defect

GoogleCodeExporter commented 8 years ago

Thanks Yasser, great job on this product.  I have suggestion with regards
to few more exception handling points where in you are using
e.printStackTrace().  I was wondering if you could change them to
logger.error() messages.  I think I saw them in PageFetcher and WebCrawler.

Original comment by svu...@gmail.com on 4 Mar 2012 at 4:39

venkat6 / crawler4j

PageFetcher.discardContentIfNotConsumed throws a lot of errors every time. #132