lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

java.util.zip.ZipException: invalid distance code #244

Closed ianmilligan1 closed 7 years ago

ianmilligan1 commented 7 years ago

We've been trying to work with an Archive-It collection that keeps throwing this error. Have been trying to isolate the problematic ARC or WARC but to no avail. We are running on a single node, most recent build of Warcbase.

Any clue what might be happening?

Script we are running:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}

val labour = 
  RecordLoader.loadArchives("/data/TORONTO_canadian_labour_unions/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/TORONTO_canadian_labour_unions")

Snippet of error message below:

2016-09-20 18:40:21,164 [Executor task launch worker-62] ERROR Executor - Exception in task 4367.0 in stage 0.0 (TID 4367)
java.util.zip.ZipException: invalid distance code
    at org.archive.util.zip.OpenJDK7InflaterInputStream.read(OpenJDK7InflaterInputStream.java:168)
    at org.archive.util.zip.OpenJDK7GZIPInputStream.read(OpenJDK7GZIPInputStream.java:122)
    at org.archive.util.zip.GZIPMembersInputStream.read(GZIPMembersInputStream.java:113)
    at org.archive.io.ArchiveRecord.read(ArchiveRecord.java:204)
    at java.io.InputStream.read(InputStream.java:101)
    at org.warcbase.data.WarcRecordUtils.copyStream(WarcRecordUtils.java:133)
    at org.warcbase.data.WarcRecordUtils.getContent(WarcRecordUtils.java:103)
    at org.warcbase.spark.archive.io.GenericArchiveRecord.<init>(GenericArchiveRecord.scala:48)
    at org.warcbase.spark.matchbox.RecordLoader$$anonfun$loadArchives$2.apply(RecordLoader.scala:45)
    at org.warcbase.spark.matchbox.RecordLoader$$anonfun$loadArchives$2.apply(RecordLoader.scala:45)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Full error message can be found here.

ianmilligan1 commented 7 years ago

I've also tried on the 50 or so files surrounding the one that it broke on, using a variation of:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}

val labour = 
  RecordLoader.loadArchives("/data/TORONTO_canadian_labour_unions/ARCHIVEIT-288-20090313205717-00187-crawling105.us.archive.org.arc.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/TORONTO_canadian_labour_unions_TEST")

And they work. Very vexing!

ianmilligan1 commented 7 years ago

Just for my own records as I try to get a managable set to send to Jimmy

Have run

*200903*
*200906*
*200909*
*2007*
*2008*
*2010*
*2011*
*2012*
*2013*
*2014*
*2015*
*2016*

Will keep hunting.

lintool commented 7 years ago

@ianmilligan1 I believe this has been fixed. Please re-open if you're still having issues...

ianmilligan1 commented 7 years ago

Great, will test!