Block when skipping large file

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
    1. set seed as http://hobbes.ics.uci.edu/downloads/indexes/fly_dm3_index_bitvec16 (it's a very large page in 1.2G)
    2. start crawler.
    3. you see the message: Skipping a page which was bigger than max allowed size: http://hobbes.ics.uci.edu/downloads/indexes/fly_dm3_index_bitvec16

What is the expected output? What do you see instead?
Should skip the file in very short time.
However, it takes forever to skip it.

What version of the product are you using?
3.3

Please provide any additional information below.
Root Cause:
    When skipping large file, the method discardIfNotConsumed is called to trash remaining bytes in the stream object. It actually takes very long until we finish reading all the 1.2G file.

Solution:
    When fetching page that is too large, call get.abort() to trash the inputstream instead of directly call close to consume the remaining bytes.

Original issue reported on code.google.com by iloves...@gmail.com on 6 Feb 2013 at 8:09

GoogleCodeExporter commented 9 years ago

Thanks for your patch I integrated into the 
http://code.google.com/r/acrocrawler-crawler4j/ clone.

Original comment by acrocraw...@gmail.com on 22 Feb 2013 at 1:20

GoogleCodeExporter commented 9 years ago

This is fixed in this changelist: 
https://code.google.com/p/crawler4j/source/detail?r=3615d8f58ea3072b85007611a547
cf21bccbfb18

Thanks,
Yasser

Original comment by ganjisaffar@gmail.com on 2 Mar 2013 at 10:15

Changed state: Fixed

mohankreddy / crawler4j

Block when skipping large file #192