lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

java.lang.NegativeArraySizeException #222

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

We (@ruebot and I) are running a URL extract job with the following script:

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val university_of_alberta_websites = 
  RecordLoader.loadArchives("/data/university_of_alberta_websites/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/university_of_alberta_websites")

On a Compute Canada VM, Ubuntu.

It fails with the following error (tested, twice):

java.lang.NegativeArraySizeException
    at org.warcbase.data.WarcRecordUtils.copyStream(WarcRecordUtils.java:125)
    at org.warcbase.data.WarcRecordUtils.getContent(WarcRecordUtils.java:98)
    at org.warcbase.spark.archive.io.GenericArchiveRecord.<init>(GenericArchiveRecord.scala:48)
    at org.warcbase.spark.matchbox.RecordLoader$$anonfun$loadArchives$2.apply(RecordLoader.scala:45)
    at org.warcbase.spark.matchbox.RecordLoader$$anonfun$loadArchives$2.apply(RecordLoader.scala:45)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Full error trace is available at https://gist.github.com/ruebot/25d505d4e530c3b9430135f6c9f140fe#file-gistfile1-txt.

Any clue what's up?

lintool commented 8 years ago

Can you narrow it down to a particular WARC that's causing the issue?

ianmilligan1 commented 8 years ago

Haven't been able to. If you look at the error trace, I've tested the last batch of WARCs that the script ingested and they all work.

i.e. tested on:

ARCHIVEIT-1830-NONE-EWVEGS-20120301170834-00230-crawling211.us.archive.org-6682.warc.gz       
ARCHIVEIT-1830-MONTHLY-PBLTCT-20121012203013-00001-wbgrp-crawl063.us.archive.org-6683.warc.gz 
ARCHIVEIT-1830-SEMIANNUAL-JOB166244-20150727112437633-00015.warc.gz                           
ARCHIVEIT-1830-NONE-TEZIEC-20111016191300-00002-crawling200.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111002023208-00154-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-MONTHLY-OZBJIK-20120612205311-00002-crawling200.us.archive.org-6681.warc.gz    
ARCHIVEIT-1830-NONE-UHNVDX-20110930193117-00255-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-MONTHLY-CSRYZP-20120815092532-00017-crawling208.us.archive.org-6681.warc.gz    
ARCHIVEIT-1830-MONTHLY-QRDYRH-20120312205931-00006-crawling113.us.archive.org-6681.warc.gz    
ARCHIVEIT-1830-NONE-NHHENM-20120529210105-00011-crawling212.us.archive.org-6680.warc.gz       
ARCHIVEIT-1830-NONE-UHNVDX-20110930061048-00095-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111001190411-00049-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111002110121-00311-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-SEMIANNUAL-JOB166244-20150727082037153-00012.warc.gz                           
ARCHIVEIT-1830-NONE-5176-20140601225449881-00000-wbgrp-crawl052.us.archive.org-6442.warc.gz   
ARCHIVEIT-1830-NONE-FWPGCP-20111002045116-00207-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120629155541-00077-crawling212.us.archive.org-6682.warc.gz  
ARCHIVEIT-1830-NONE-FRYBJH-20111008205513-00038-crawling208.us.archive.org-6683.warc.gz       
ARCHIVEIT-1830-NONE-FWPGCP-20111001145505-00017-crawling209.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-XCUJHA-20111206043053-00000-crawling206.us.archive.org-6683.warc.gz       
ARCHIVEIT-1830-NONE-UHNVDX-20110930101326-00147-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-EWVEGS-20120301152756-00203-crawling211.us.archive.org-6682.warc.gz       
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120628023343-00007-crawling212.us.archive.org-6682.warc.gz  
ARCHIVEIT-1830-NONE-CHONHQ-20111006151942-00058-crawling205.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-NONE-2540-20140603145904065-00000-wbgrp-crawl104.us.archive.org-6445.warc.gz   
ARCHIVEIT-1830-NONE-YDUKWP-20111130205519-00016-crawling203.us.archive.org-6680.warc.gz       
ARCHIVEIT-1830-NONE-UHNVDX-20110930090309-00132-crawling202.us.archive.org-6681.warc.gz       
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120629222444-00101-crawling212.us.archive.org-6682.warc.gz  

So either our error logging in fishy, or something's happening in the combination of data?

(have I missed a WARC here, @ruebot?)

ianmilligan1 commented 8 years ago

Just had this happen again on a collection that we had successfully run URL extraction on, but crashed during link extraction (twice).

[Stage 0:====================>                               (1048 + 16) / 2673]INFO  WacGenericInputFormat - Loading file:/data/idle_no_more/ARCHIVEIT-3490-DAILY-RGLLBX-20130130061306-00006-wbgrp-crawl054.us.archive.org-6680.warc.gz
[Stage 0:====================>                               (1049 + 16) / 2673]ERROR Executor - Exception in task 1036.0 in stage 0.0 (TID 1036)
java.lang.NullPointerException
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
INFO  WacGenericInputFormat - Loading file:/data/idle_no_more/ARCHIVEIT-3490-NONE-OUDOSH-20130201045442-00313-wbgrp-crawl057.us.archive.org-6682.warc.gz
WARN  TaskSetManager - Lost task 1036.0 in stage 0.0 (TID 1036, localhost): java.lang.NullPointerException
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

ERROR TaskSetManager - Task 1036 in stage 0.0 failed 1 times; aborting job
WARN  TaskSetManager - Lost task 1063.0 in stage 0.0 (TID 1063, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1048.0 in stage 0.0 (TID 1048, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1059.0 in stage 0.0 (TID 1059, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1054.0 in stage 0.0 (TID 1054, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1062.0 in stage 0.0 (TID 1062, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1039.0 in stage 0.0 (TID 1039, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 997.0 in stage 0.0 (TID 997, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1058.0 in stage 0.0 (TID 1058, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1061.0 in stage 0.0 (TID 1061, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1065.0 in stage 0.0 (TID 1065, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1057.0 in stage 0.0 (TID 1057, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1056.0 in stage 0.0 (TID 1056, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1053.0 in stage 0.0 (TID 1053, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1036 in stage 0.0 failed 1 times, most recent failure: Lost task 1036.0 in stage 0.0 (TID 1036, localhost): java.lang.NullPointerException
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
        at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
        at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
        at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
        at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
        at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:547)
        at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:548)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
        at org.apache.spark.rdd.RDD.sortBy(RDD.scala:545)
        at org.warcbase.spark.rdd.RecordRDD$CountableRDD.countItems(RecordRDD.scala:40)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
        at $iwC$$iwC$$iwC.<init>(<console>:88)
        at $iwC$$iwC.<init>(<console>:90)
        at $iwC.<init>(<console>:92)
        at <init>(<console>:94)
        at .<init>(<console>:98)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
        at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
        at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
        at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$pasteCommand(SparkILoop.scala:825)
        at org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:345)
        at org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:345)
        at scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
        at scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
        at scala.tools.nsc.interpreter.LoopCommands$NullaryCmd.apply(LoopCommands.scala:76)
        at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:809)
        at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
        at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
        at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
        at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
        at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
        at org.apache.spark.repl.Main$.main(Main.scala:31)
        at org.apache.spark.repl.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
        at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
        at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
        at scala.collection.AbstractTraversable.map(Traversable.scala:105)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

scala> WARN  TaskSetManager - Lost task 1064.0 in stage 0.0 (TID 1064, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1044.0 in stage 0.0 (TID 1044, localhost): TaskKilled (killed intentionally)
WARN  TaskSetManager - Lost task 1045.0 in stage 0.0 (TID 1045, localhost): TaskKilled (killed intentionally)
jrwiebe commented 8 years ago

I just ran the same script on Rho (/mnt/vol1/data_sets/walk-test/*.gz) and it worked.

ianmilligan1 commented 8 years ago

Aye, works on some collections and not on others. I guess it must be related to funky data, although there's ton of it within these Archive-It collections.

@ruebot – maybe we should move a funky collection over to rho, so we can make sure it's not the setup on WALK somehow..

ruebot commented 8 years ago

Sure. Tell me what collection to copy over, and I'll make it happen.

ianmilligan1 commented 8 years ago

Why don't we move university_of_alberta_websites over, run a variation of

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val university_of_alberta_websites = 
  RecordLoader.loadArchives("/data/university_of_alberta_websites/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/data/derivatives/urls/university_of_alberta_websites")

and see if it blows?

ruebot commented 8 years ago

rsyncing over now.

ruebot commented 8 years ago

Forgot to say it is was done. Test directory is /mnt/vol1/data_sets/TEST on rho.

ianmilligan1 commented 8 years ago

👍 @ruebot.

Am running this on rho. We'll see if it's a dataset problem or a WALK problem. Stay tuned!

import org.warcbase.spark.matchbox._ 
import org.warcbase.spark.rdd.RecordRDD._ 

val university_of_alberta_websites = 
  RecordLoader.loadArchives("/mnt/vol1/data_sets/TEST/*.gz", sc) 
  .keepValidPages() 
  .map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl))) 
  .countItems() 
  .saveAsTextFile("/mnt/vol1/derivative_data/walk/university_of_alberta_websites")
ianmilligan1 commented 8 years ago

Curses. Failed again with this error. Different this time.

At least we know it's not related to system, but connected to the files. I guess next step, try error logging.. isolating a WARC or something. 😦

anjackson commented 8 years ago

I note that this error is thrown here.

The code is minting a byte array, which will choke on large (>2GB) payloads. Firstly, somewhere upstream you are casting recordLength to an int, and because the value can be long this will set the sign bit sometimes, creating a negative value.

But that's not really the point because arrays in Java are limited to 2GB anyway. If you are going to read into a byte array you'll need to truncate the payload (ensuring byte[].length <= Integer.MAX_VALUE). FWIW, in webarchive-discovery I used a streaming interface rather than an in-memory array, which is trickier but significantly reduces memory pressure.

ruebot commented 8 years ago

Ohhhh. That makes sense because the great majority if the warcs in the dataset are around the ~1GB default. But, there are scattering of ~20GB warcs. On Jun 14, 2016 7:56 PM, "Andy Jackson" notifications@github.com wrote:

I note that this error is thrown here https://github.com/lintool/warcbase/blob/db5c84770adbec1496c85f1beb5f0936fb751906/src/main/java/org/warcbase/data/WarcRecordUtils.java#L125 .

The code is minting a byte array, which will choke on large (>2GB) payloads. Firstly, somewhere upstream you are casting recordLength to an int, and because the value can be long this will set the sign bit sometimes, creating a negative value.

But that's not really the point because arrays in Java are limited to 2GB anyway. If you are going to read into a byte array you'll need to truncate the payload (ensuring byte[].length < Integer.MAX_VALUE). FWIW, in webarchive-discovery I used a streaming interface rather than an in-memory array, which is trickier but significantly reduces memory pressure.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lintool/warcbase/issues/222#issuecomment-225981569, or mute the thread https://github.com/notifications/unsubscribe/AANVwcPexv8ea6Sd0_y15md2DZ36nw5sks5qLvl2gaJpZM4IQX8c .

lintool commented 8 years ago

Seems to be the same issue #234 that we're encountering at ArchivesUnleashed hackathon 2.0. Moving discussion more to there.

ianmilligan1 commented 8 years ago

Closed as moving to #234, and opening up new ticket on WALK