Closed ianmilligan1 closed 8 years ago
Can you narrow it down to a particular WARC that's causing the issue?
Haven't been able to. If you look at the error trace, I've tested the last batch of WARCs that the script ingested and they all work.
i.e. tested on:
ARCHIVEIT-1830-NONE-EWVEGS-20120301170834-00230-crawling211.us.archive.org-6682.warc.gz
ARCHIVEIT-1830-MONTHLY-PBLTCT-20121012203013-00001-wbgrp-crawl063.us.archive.org-6683.warc.gz
ARCHIVEIT-1830-SEMIANNUAL-JOB166244-20150727112437633-00015.warc.gz
ARCHIVEIT-1830-NONE-TEZIEC-20111016191300-00002-crawling200.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-FWPGCP-20111002023208-00154-crawling209.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-MONTHLY-OZBJIK-20120612205311-00002-crawling200.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-UHNVDX-20110930193117-00255-crawling202.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-MONTHLY-CSRYZP-20120815092532-00017-crawling208.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-MONTHLY-QRDYRH-20120312205931-00006-crawling113.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-NHHENM-20120529210105-00011-crawling212.us.archive.org-6680.warc.gz
ARCHIVEIT-1830-NONE-UHNVDX-20110930061048-00095-crawling202.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-FWPGCP-20111001190411-00049-crawling209.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-FWPGCP-20111002110121-00311-crawling209.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-SEMIANNUAL-JOB166244-20150727082037153-00012.warc.gz
ARCHIVEIT-1830-NONE-5176-20140601225449881-00000-wbgrp-crawl052.us.archive.org-6442.warc.gz
ARCHIVEIT-1830-NONE-FWPGCP-20111002045116-00207-crawling209.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120629155541-00077-crawling212.us.archive.org-6682.warc.gz
ARCHIVEIT-1830-NONE-FRYBJH-20111008205513-00038-crawling208.us.archive.org-6683.warc.gz
ARCHIVEIT-1830-NONE-FWPGCP-20111001145505-00017-crawling209.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-XCUJHA-20111206043053-00000-crawling206.us.archive.org-6683.warc.gz
ARCHIVEIT-1830-NONE-UHNVDX-20110930101326-00147-crawling202.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-EWVEGS-20120301152756-00203-crawling211.us.archive.org-6682.warc.gz
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120628023343-00007-crawling212.us.archive.org-6682.warc.gz
ARCHIVEIT-1830-NONE-CHONHQ-20111006151942-00058-crawling205.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-NONE-2540-20140603145904065-00000-wbgrp-crawl104.us.archive.org-6445.warc.gz
ARCHIVEIT-1830-NONE-YDUKWP-20111130205519-00016-crawling203.us.archive.org-6680.warc.gz
ARCHIVEIT-1830-NONE-UHNVDX-20110930090309-00132-crawling202.us.archive.org-6681.warc.gz
ARCHIVEIT-1830-QUARTERLY-MWZFCS-20120629222444-00101-crawling212.us.archive.org-6682.warc.gz
So either our error logging in fishy, or something's happening in the combination of data?
(have I missed a WARC here, @ruebot?)
Just had this happen again on a collection that we had successfully run URL extraction on, but crashed during link extraction (twice).
[Stage 0:====================> (1048 + 16) / 2673]INFO WacGenericInputFormat - Loading file:/data/idle_no_more/ARCHIVEIT-3490-DAILY-RGLLBX-20130130061306-00006-wbgrp-crawl054.us.archive.org-6680.warc.gz
[Stage 0:====================> (1049 + 16) / 2673]ERROR Executor - Exception in task 1036.0 in stage 0.0 (TID 1036)
java.lang.NullPointerException
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
INFO WacGenericInputFormat - Loading file:/data/idle_no_more/ARCHIVEIT-3490-NONE-OUDOSH-20130201045442-00313-wbgrp-crawl057.us.archive.org-6682.warc.gz
WARN TaskSetManager - Lost task 1036.0 in stage 0.0 (TID 1036, localhost): java.lang.NullPointerException
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $line15.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
ERROR TaskSetManager - Task 1036 in stage 0.0 failed 1 times; aborting job
WARN TaskSetManager - Lost task 1063.0 in stage 0.0 (TID 1063, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1048.0 in stage 0.0 (TID 1048, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1059.0 in stage 0.0 (TID 1059, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1054.0 in stage 0.0 (TID 1054, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1062.0 in stage 0.0 (TID 1062, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1039.0 in stage 0.0 (TID 1039, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 997.0 in stage 0.0 (TID 997, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1058.0 in stage 0.0 (TID 1058, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1061.0 in stage 0.0 (TID 1061, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1065.0 in stage 0.0 (TID 1065, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1057.0 in stage 0.0 (TID 1057, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1056.0 in stage 0.0 (TID 1056, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1053.0 in stage 0.0 (TID 1053, localhost): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1036 in stage 0.0 failed 1 times, most recent failure: Lost task 1036.0 in stage 0.0 (TID 1036, localhost): java.lang.NullPointerException
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:547)
at org.apache.spark.rdd.RDD$$anonfun$sortBy$1.apply(RDD.scala:548)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:545)
at org.warcbase.spark.rdd.RecordRDD$CountableRDD.countItems(RecordRDD.scala:40)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:80)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:82)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:84)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:86)
at $iwC$$iwC$$iwC.<init>(<console>:88)
at $iwC$$iwC.<init>(<console>:90)
at $iwC.<init>(<console>:92)
at <init>(<console>:94)
at .<init>(<console>:98)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$pasteCommand(SparkILoop.scala:825)
at org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:345)
at org.apache.spark.repl.SparkILoop$$anonfun$standardCommands$8.apply(SparkILoop.scala:345)
at scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
at scala.tools.nsc.interpreter.LoopCommands$LoopCommand$$anonfun$nullary$1.apply(LoopCommands.scala:65)
at scala.tools.nsc.interpreter.LoopCommands$NullaryCmd.apply(LoopCommands.scala:76)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:809)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2$$anonfun$apply$1.apply(<console>:27)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:60)
at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:27)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:209)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
scala> WARN TaskSetManager - Lost task 1064.0 in stage 0.0 (TID 1064, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1044.0 in stage 0.0 (TID 1044, localhost): TaskKilled (killed intentionally)
WARN TaskSetManager - Lost task 1045.0 in stage 0.0 (TID 1045, localhost): TaskKilled (killed intentionally)
I just ran the same script on Rho (/mnt/vol1/data_sets/walk-test/*.gz) and it worked.
Aye, works on some collections and not on others. I guess it must be related to funky data, although there's ton of it within these Archive-It collections.
@ruebot – maybe we should move a funky collection over to rho
, so we can make sure it's not the setup on WALK somehow..
Sure. Tell me what collection to copy over, and I'll make it happen.
Why don't we move university_of_alberta_websites
over, run a variation of
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val university_of_alberta_websites =
RecordLoader.loadArchives("/data/university_of_alberta_websites/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/data/derivatives/urls/university_of_alberta_websites")
and see if it blows?
rsyncing over now.
Forgot to say it is was done. Test directory is /mnt/vol1/data_sets/TEST
on rho.
👍 @ruebot.
Am running this on rho. We'll see if it's a dataset problem or a WALK problem. Stay tuned!
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val university_of_alberta_websites =
RecordLoader.loadArchives("/mnt/vol1/data_sets/TEST/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth, ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/mnt/vol1/derivative_data/walk/university_of_alberta_websites")
Curses. Failed again with this error. Different this time.
At least we know it's not related to system, but connected to the files. I guess next step, try error logging.. isolating a WARC or something. 😦
I note that this error is thrown here.
The code is minting a byte array, which will choke on large (>2GB) payloads. Firstly, somewhere upstream you are casting recordLength
to an int
, and because the value can be long
this will set the sign bit sometimes, creating a negative value.
But that's not really the point because arrays in Java are limited to 2GB anyway. If you are going to read into a byte array you'll need to truncate the payload (ensuring byte[].length <= Integer.MAX_VALUE
). FWIW, in webarchive-discovery I used a streaming interface rather than an in-memory array, which is trickier but significantly reduces memory pressure.
Ohhhh. That makes sense because the great majority if the warcs in the dataset are around the ~1GB default. But, there are scattering of ~20GB warcs. On Jun 14, 2016 7:56 PM, "Andy Jackson" notifications@github.com wrote:
I note that this error is thrown here https://github.com/lintool/warcbase/blob/db5c84770adbec1496c85f1beb5f0936fb751906/src/main/java/org/warcbase/data/WarcRecordUtils.java#L125 .
The code is minting a byte array, which will choke on large (>2GB) payloads. Firstly, somewhere upstream you are casting recordLength to an int, and because the value can be long this will set the sign bit sometimes, creating a negative value.
But that's not really the point because arrays in Java are limited to 2GB anyway. If you are going to read into a byte array you'll need to truncate the payload (ensuring byte[].length < Integer.MAX_VALUE). FWIW, in webarchive-discovery I used a streaming interface rather than an in-memory array, which is trickier but significantly reduces memory pressure.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lintool/warcbase/issues/222#issuecomment-225981569, or mute the thread https://github.com/notifications/unsubscribe/AANVwcPexv8ea6Sd0_y15md2DZ36nw5sks5qLvl2gaJpZM4IQX8c .
Seems to be the same issue #234 that we're encountering at ArchivesUnleashed hackathon 2.0. Moving discussion more to there.
Closed as moving to #234, and opening up new ticket on WALK
We (@ruebot and I) are running a URL extract job with the following script:
On a Compute Canada VM, Ubuntu.
It fails with the following error (tested, twice):
Full error trace is available at https://gist.github.com/ruebot/25d505d4e530c3b9430135f6c9f140fe#file-gistfile1-txt.
Any clue what's up?