internetarchive / jbs

Builds Lucene/Solr indexes out of NutchWAX segments and revisit records via Hadoop.
Apache License 2.0
2 stars 4 forks source link

Parsing stops with IOException #2

Open mikemccabe opened 11 years ago

mikemccabe commented 11 years ago

http://www-tracey.us.archive.org/log_show.php?task_id=160072750

[ PDT: 2013-06-12 12:23:41 ] Executing: JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 PARSE_HOME=/home/mccabe/petabox/sw/lib/parse /home/mccabe/petabox/sw/lib/parse/bin/parse.sh 1024 /var/tmp/autoclean/derive-SURV-20130530103517-crawl452-Parse/ '/var/tmp/autoclean/derive/SURV-20130530103517-crawl452/SURV-20130531055121-00016.warc.gz' '/var/tmp/autoclean/derive/SURV-20130530103517-crawl452/SURV-20130531055121-00016_parsed.json.gz' 13/06/12 19:43:49 ERROR tika.TikaParser: Error parsing http://falkdalen.com/ java.io.IOException: Cannot remove block[ 234881024 ]; out of range[ 0 - 39 ] at org.apache.poi.poifs.storage.BlockListImpl.remove(BlockListImpl.java:98) at org.apache.poi.poifs.storage.SmallDocumentBlockList.remove(SmallDocumentBlockList.java:30) at org.apache.poi.poifs.storage.BlockAllocationTableReader.fetchBlocks(BlockAllocationTableReader.java:191) at org.apache.poi.poifs.storage.BlockListImpl.fetchBlocks(BlockListImpl.java:123) at org.apache.poi.poifs.storage.SmallDocumentBlockList.fetchBlocks(SmallDocumentBlockList.java:30) at org.apache.poi.poifs.filesystem.POIFSFileSystem.processProperties(POIFSFileSystem.java:534) at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:176) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:74) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.archive.jbs.Parse$ParseMapper.write(Parse.java:273) at org.archive.jbs.Parse$ParseMapper.parseRecord(Parse.java:247) at org.archive.jbs.Parse$ParseMapper.map(Parse.java:118) at org.archive.jbs.Parse$ParseMapper.map(Parse.java:71) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)

aaronbinns commented 11 years ago

That "error" message is a red herring. As I explained to Hank, the low-level parsers write directly to sterr and therefore their "error" messages about not being able to parse a record cannot be prevented from displaying in the log. It is quite common for a parser to fail to parse a record, especially for MS Office documents, as there are so many bogus MS Office documents out there. So, seeing these kind of low-level parse error messages in the log is not uncommon.

However, I grabbed one of the WARC files from that item's log and it looks like the WARC file has an invalid gzip record:

http://archive.org/download/SURV-20130530103517-crawl452/SURV-20130530171024-00011.warc.gz

Then using 'gzip -tv' on the warc.gz file, I get

gzip: SURV-20130530171024-00011.warc.gz: invalid compressed data--crc error
gzip: SURV-20130530171024-00011.warc.gz: invalid compressed data--length error

And unfortunately, the error message that the Parse command emits reporting the bad gzip record is not captured in the item log.

So, yes there is an error causing the overall parse to fail-- the WARC file has a bad gzip record -- but the low-level parse error message makes it appear that the parse error is causing the overall parse to fail. The low-level parse error is not the problem, the invalid gzip record is.

You might consider adding a 'gzip -tv' step to the deriver to flag these bad-gzip (W)ARC files, and then skip the parsing of them (for now).

Another option is to talk to Noah about the 'gzip-chunks' utility he wrote which can take a (W)ARC file with a bad gzip-envelope and remove it. You could change the deriver logic to something like

if ! gzip -tv ${file}; then
  gzip-chunks ${file} > fixed/${file}
  mv fixed/${file} ${file}
<normal parse command on ${file}>