Open mikemccabe opened 11 years ago
That "error" message is a red herring. As I explained to Hank, the low-level parsers write directly to sterr and therefore their "error" messages about not being able to parse a record cannot be prevented from displaying in the log. It is quite common for a parser to fail to parse a record, especially for MS Office documents, as there are so many bogus MS Office documents out there. So, seeing these kind of low-level parse error messages in the log is not uncommon.
However, I grabbed one of the WARC files from that item's log and it looks like the WARC file has an invalid gzip record:
http://archive.org/download/SURV-20130530103517-crawl452/SURV-20130530171024-00011.warc.gz
Then using 'gzip -tv' on the warc.gz file, I get
gzip: SURV-20130530171024-00011.warc.gz: invalid compressed data--crc error
gzip: SURV-20130530171024-00011.warc.gz: invalid compressed data--length error
And unfortunately, the error message that the Parse command emits reporting the bad gzip record is not captured in the item log.
So, yes there is an error causing the overall parse to fail-- the WARC file has a bad gzip record -- but the low-level parse error message makes it appear that the parse error is causing the overall parse to fail. The low-level parse error is not the problem, the invalid gzip record is.
You might consider adding a 'gzip -tv' step to the deriver to flag these bad-gzip (W)ARC files, and then skip the parsing of them (for now).
Another option is to talk to Noah about the 'gzip-chunks' utility he wrote which can take a (W)ARC file with a bad gzip-envelope and remove it. You could change the deriver logic to something like
if ! gzip -tv ${file}; then
gzip-chunks ${file} > fixed/${file}
mv fixed/${file} ${file}
<normal parse command on ${file}>
http://www-tracey.us.archive.org/log_show.php?task_id=160072750
[ PDT: 2013-06-12 12:23:41 ] Executing: JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 PARSE_HOME=/home/mccabe/petabox/sw/lib/parse /home/mccabe/petabox/sw/lib/parse/bin/parse.sh 1024 /var/tmp/autoclean/derive-SURV-20130530103517-crawl452-Parse/ '/var/tmp/autoclean/derive/SURV-20130530103517-crawl452/SURV-20130531055121-00016.warc.gz' '/var/tmp/autoclean/derive/SURV-20130530103517-crawl452/SURV-20130531055121-00016_parsed.json.gz' 13/06/12 19:43:49 ERROR tika.TikaParser: Error parsing http://falkdalen.com/ java.io.IOException: Cannot remove block[ 234881024 ]; out of range[ 0 - 39 ] at org.apache.poi.poifs.storage.BlockListImpl.remove(BlockListImpl.java:98) at org.apache.poi.poifs.storage.SmallDocumentBlockList.remove(SmallDocumentBlockList.java:30) at org.apache.poi.poifs.storage.BlockAllocationTableReader.fetchBlocks(BlockAllocationTableReader.java:191) at org.apache.poi.poifs.storage.BlockListImpl.fetchBlocks(BlockListImpl.java:123) at org.apache.poi.poifs.storage.SmallDocumentBlockList.fetchBlocks(SmallDocumentBlockList.java:30) at org.apache.poi.poifs.filesystem.POIFSFileSystem.processProperties(POIFSFileSystem.java:534) at org.apache.poi.poifs.filesystem.POIFSFileSystem.(POIFSFileSystem.java:176) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:74) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.archive.jbs.Parse$ParseMapper.write(Parse.java:273) at org.archive.jbs.Parse$ParseMapper.parseRecord(Parse.java:247) at org.archive.jbs.Parse$ParseMapper.map(Parse.java:118) at org.archive.jbs.Parse$ParseMapper.map(Parse.java:71) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)