internetarchive / jbs

Builds Lucene/Solr indexes out of NutchWAX segments and revisit records via Hadoop.
Apache License 2.0
2 stars 4 forks source link

Parsing stops with 'IllegalStateException' #1

Open mikemccabe opened 11 years ago

mikemccabe commented 11 years ago

http://www-mccabe.us.archive.org/log_show.php?task_id=160072727

[ PDT: 2013-06-12 11:34:08 ] Executing: JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 PARSE_HOME=/home/mccabe/petabox/sw/lib/parse /home/mccabe/petabox/sw/lib/parse/bin/parse.sh 1024 /var/tmp/autoclean/derive-SURV-20130604153020-crawl451-Parse/ '/var/tmp/autoclean/derive/SURV-20130604153020-crawl451/SURV-20130605195247-00053.warc.gz' '/var/tmp/autoclean/derive/SURV-20130604153020-crawl451/SURV-20130605195247-00053_parsed.json.gz' 13/06/12 18:50:27 ERROR tika.TikaParser: Error parsing http://jhdb.net/robots.txt java.lang.IllegalStateException: Table Stream '0Table' wasn't found - Either the document is corrupt, or is Word95 (or earlier) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:201) at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:157) at org.apache.poi.hwpf.extractor.WordExtractor.(WordExtractor.java:62) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:95) at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.archive.jbs.Parse$ParseMapper.write(Parse.java:273) at org.archive.jbs.Parse$ParseMapper.parseRecord(Parse.java:247) at org.archive.jbs.Parse$ParseMapper.map(Parse.java:118) at org.archive.jbs.Parse$ParseMapper.map(Parse.java:71) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)