Closed ianmilligan1 closed 7 years ago
I've also tried on the 50 or so files surrounding the one that it broke on, using a variation of:
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText}
val labour =
RecordLoader.loadArchives("/data/TORONTO_canadian_labour_unions/ARCHIVEIT-288-20090313205717-00187-crawling105.us.archive.org.arc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/data/derivatives/urls/TORONTO_canadian_labour_unions_TEST")
And they work. Very vexing!
Just for my own records as I try to get a managable set to send to Jimmy
Have run
*200903*
*200906*
*200909*
*2007*
*2008*
*2010*
*2011*
*2012*
*2013*
*2014*
*2015*
*2016*
Will keep hunting.
@ianmilligan1 I believe this has been fixed. Please re-open if you're still having issues...
Great, will test!
We've been trying to work with an Archive-It collection that keeps throwing this error. Have been trying to isolate the problematic ARC or WARC but to no avail. We are running on a single node, most recent build of Warcbase.
Any clue what might be happening?
Script we are running:
Snippet of error message below:
Full error message can be found here.