hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
83 stars 21 forks source link

throwed exception: ./scripts/guacamole: line 39 :exec: time: not found #572

Open car2008 opened 7 years ago

car2008 commented 7 years ago

Hi , my command is ./scripts/guacamole somatic-joint /home/zhipeng/soft/guacamole/src/test/resources/synth1.normal.100k-200k.withmd.bam /home/zhipeng/soft/guacamole/src/test/resources/synth1.tumor.100k-200k.withmd.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf but it throwed exception :

./scripts/guacamole: line 39 :exec: time: not found

I don't know how to resole it?

timodonnell commented 7 years ago

Looks like your shell doesn't have the "time" command. Try just deleting the word "time" from the last line of scripts/guacamole, i.e. instead of:

exec time java -Xmx4g -XX:MaxPermSize=512m "-Dspark.master=local[1]" -cp "$GUACAMOLE_JAR:$GUACAMOLE_DEPS_JAR" org.hammerlab.guacamole.Main "$@"

replace with:

exec java -Xmx4g -XX:MaxPermSize=512m "-Dspark.master=local[1]" -cp "$GUACAMOLE_JAR:$GUACAMOLE_DEPS_JAR" org.hammerlab.guacamole.Main "$@"

We should probably do this permanently so others don't hit this issue.

On Fri, Sep 9, 2016 at 5:00 AM, ZhipengCheng notifications@github.com wrote:

Hi , my command is ./scripts/guacamole somatic-joint /home/zhipeng/soft/guacamole/src/test/resources/synth1.normal.100k-200k.withmd.bam /home/zhipeng/soft/guacamole/src/test/resources/synth1.tumor.100k-200k.withmd.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf and it throwed exception :

./scripts/guacamole: line 39 :exec: time: not found

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hammerlab/guacamole/issues/572, or mute the thread https://github.com/notifications/unsubscribe-auth/AAcjuGRlOM0cgtHoU2LPf4jt4N2510sKks5qoSAsgaJpZM4J41ze .

ryan-williams commented 7 years ago

Interesting, I thought time was pretty standard in bash shells, which scripts/guacamole's shebang line specifies.

What shell are you using @car2008?

Seems like we probably implicitly make stronger assumptions elsewhere about the shell environment we're running in, so might be worth trying to understand what that space looks like as well.

car2008 commented 7 years ago

Hi @timodonnell and @ryan-williams ,thank you very much , the shell informations i'm using are :

 zhipengcheng@192$ cat /etc/shells
/bin/sh
/bin/bash
/sbin/nologin
/usr/bin/sh
/usr/bin/bash
/usr/sbin/nologin
08:32:50 ~/soft1/guacamole
zhipengcheng@192$bash --version
GNU bash, version 4.2.46(1)-release (x86_64-redhat-linux-gnu)
Copyright (C) 2011 Free Software Foundation, Inc.
 licence GPLv3+: GNU GPL  licence version 3 or higher <http://gnu.org/licenses/gpl.html>

And I deleted the word "time" from the last line of scripts/guacamole as @timodonnell mentioned above. It can run ,but throws another exception : the command is: ./scripts/guacamole somatic-joint /home/zhipengcheng/soft1/guacamole/src/test/resources/synth1.normal.100k-200k.withmd.bam /home/zhipengcheng/soft1/guacamole/src/test/resources/synth1.tumor.100k-200k.withmd.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf the exception is (Is there something wrong with my spark1.6.1?): Found guac jar: target/guacamole-0.0.1-SNAPSHOT.jar --> [Mon Sep 12 08:01:15 CST 2016]: Guacamole starting. 2016-09-12 08:01:15 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable --> [3.00 sec. later]: Using samtools without BAM index (index unavailable) to read: /home/zhipengcheng/soft1/guacamole/src/test/resources/synth1.normal.100k-200k.withmd.bam --> [1.00 sec. later]: Using samtools without BAM index (index unavailable) to read: /home/zhipengcheng/soft1/guacamole/src/test/resources/synth1.tumor.100k-200k.withmd.bam --> [89.83 sec. later]: Partitioning loci 2016-09-12 08:02:53 WARN TaskSetManager:70 - Stage 0 contains a task of very large size (32697 KB). The maximum recommended task size is 100 KB. --> [13.03 sec. later]: Partitioned loci: 2 partitions. Partition-size stats: num: 2, mean: 100166.5, stddev: 12070.5, mad: 12070.5 elems: 112237, 88096 sorted: 88096, 112237

Contigs-spanned-per-partition stats: num: 2, mean: 1.5, stddev: 0.5, mad: 0.5 elems: 2, 1 sorted: 1, 2 --> [0.02 sec. later]: Partitioning reads into 2 partitions according to loci partitioning: 19:99899-107890=0,19:107942-121113=0,19:121116-146986=0,19:146989-186580=0,19:186582-188055=0,19:188055-200095=1,20:99902-200099=1 2016-09-12 08:03:02 WARN TaskSetManager:70 - Stage 5 contains a task of very large size (32697 KB). The maximum recommended task size is 100 KB. 2016-09-12 08:03:10 ERROR Executor:74 - Managed memory leak detected; size = 174498752 bytes, TID = 10 2016-09-12 08:03:10 ERROR Executor:95 - Exception in task 0.0 in stage 6.0 (TID 10) org.hammerlab.guacamole.reference.ContigNotFound: Contig 19 does not exist in the current reference. Available contigs are chr11,chr1_gl000191_random,chr5,chr22,chr1_gl000192_random,chrUn_gl000221,chrUn_gl000234,chr18_gl000207_random,chrUn_gl000232,chrUn_gl000224,chr8,chrUn_gl000248,chrUn_gl000243,chrUn_gl000238,chrUn_gl000249,chrUn_gl000228,chr21_gl000210_random,chrUn_gl000217,chr19,chrY,chr9_gl000198_random,chr6_mann_hap4,chr1,chr4_gl000193_random,chr15,chrUn_gl000213,chr17_gl000204_random,chr19_gl000208_random,chrUn_gl000212,chr17_gl000206_random,chrUn_gl000219,chr12,chr18,chr20,chr8_gl000197_random,chr6_cox_hap2,chr2,chrUn_gl000223,chrUn_gl000237,chr9_gl000200_random,chr7_gl000195_random,chr13,chr6_apd_hap1,chrUn_gl000233,chrUn_gl000227,chrUn_gl000222,chrUn_gl000211,chr9_gl000199_random,chrUn_gl000245,chrUn_gl000242,chrUn_gl000216,chr7,chr19_gl000209_random,chrUn_gl000230,chr14,chrUn_gl000215,chrUn_gl000236,chr6_mcf_hap5,chr3,chrUn_gl000226,chr17_gl000203_random,chrM,chr17,chr6_dbb_hap3,chrUn_gl000246,chr4_gl000194_random,chr11_gl000202_random,chr17_ctg5_hap1,chrUn_gl000241,chr6_qbl_hap6,chr4,chrUn_gl000225,chrUn_gl000231,chrUn_gl000240,chr6,chr9,chr9_gl000201_random,chr4_ctg9_hap1,chrUn_gl000235,chrUn_gl000244,chrUn_gl000220,chrUn_gl000214,chr17_gl000205_random,chrX,chrUn_gl000239,chr6_ssto_hap7,chrUn_gl000218,chrUn_gl000229,chr10,chr21,chr16,chrUn_gl000247,chr8_gl000196_random at org.hammerlab.guacamole.reference.ReferenceBroadcast.getContig(ReferenceBroadcast.scala:18) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:123) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:116) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:116) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:104) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:65) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:55) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:141) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:131) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-09-12 08:03:10 WARN TaskSetManager:70 - Lost task 0.0 in stage 6.0 (TID 10, localhost): org.hammerlab.guacamole.reference.ContigNotFound: Contig 19 does not exist in the current reference. Available contigs are chr11,chr1_gl000191_random,chr5,chr22,chr1_gl000192_random,chrUn_gl000221,chrUn_gl000234,chr18_gl000207_random,chrUn_gl000232,chrUn_gl000224,chr8,chrUn_gl000248,chrUn_gl000243,chrUn_gl000238,chrUn_gl000249,chrUn_gl000228,chr21_gl000210_random,chrUn_gl000217,chr19,chrY,chr9_gl000198_random,chr6_mann_hap4,chr1,chr4_gl000193_random,chr15,chrUn_gl000213,chr17_gl000204_random,chr19_gl000208_random,chrUn_gl000212,chr17_gl000206_random,chrUn_gl000219,chr12,chr18,chr20,chr8_gl000197_random,chr6_cox_hap2,chr2,chrUn_gl000223,chrUn_gl000237,chr9_gl000200_random,chr7_gl000195_random,chr13,chr6_apd_hap1,chrUn_gl000233,chrUn_gl000227,chrUn_gl000222,chrUn_gl000211,chr9_gl000199_random,chrUn_gl000245,chrUn_gl000242,chrUn_gl000216,chr7,chr19_gl000209_random,chrUn_gl000230,chr14,chrUn_gl000215,chrUn_gl000236,chr6_mcf_hap5,chr3,chrUn_gl000226,chr17_gl000203_random,chrM,chr17,chr6_dbb_hap3,chrUn_gl000246,chr4_gl000194_random,chr11_gl000202_random,chr17_ctg5_hap1,chrUn_gl000241,chr6_qbl_hap6,chr4,chrUn_gl000225,chrUn_gl000231,chrUn_gl000240,chr6,chr9,chr9_gl000201_random,chr4_ctg9_hap1,chrUn_gl000235,chrUn_gl000244,chrUn_gl000220,chrUn_gl000214,chr17_gl000205_random,chrX,chrUn_gl000239,chr6_ssto_hap7,chrUn_gl000218,chrUn_gl000229,chr10,chr21,chr16,chrUn_gl000247,chr8_gl000196_random at org.hammerlab.guacamole.reference.ReferenceBroadcast.getContig(ReferenceBroadcast.scala:18) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:123) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:116) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:116) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:104) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:65) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:55) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:141) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:131) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

2016-09-12 08:03:10 ERROR TaskSetManager:74 - Task 0 in stage 6.0 failed 1 times; aborting job 2016-09-12 08:03:10 WARN TaskSetManager:70 - Lost task 1.0 in stage 6.0 (TID 11, localhost): TaskKilled (killed intentionally) Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 10, localhost): org.hammerlab.guacamole.reference.ContigNotFound: Contig 19 does not exist in the current reference. Available contigs are chr11,chr1_gl000191_random,chr5,chr22,chr1_gl000192_random,chrUn_gl000221,chrUn_gl000234,chr18_gl000207_random,chrUn_gl000232,chrUn_gl000224,chr8,chrUn_gl000248,chrUn_gl000243,chrUn_gl000238,chrUn_gl000249,chrUn_gl000228,chr21_gl000210_random,chrUn_gl000217,chr19,chrY,chr9_gl000198_random,chr6_mann_hap4,chr1,chr4_gl000193_random,chr15,chrUn_gl000213,chr17_gl000204_random,chr19_gl000208_random,chrUn_gl000212,chr17_gl000206_random,chrUn_gl000219,chr12,chr18,chr20,chr8_gl000197_random,chr6_cox_hap2,chr2,chrUn_gl000223,chrUn_gl000237,chr9_gl000200_random,chr7_gl000195_random,chr13,chr6_apd_hap1,chrUn_gl000233,chrUn_gl000227,chrUn_gl000222,chrUn_gl000211,chr9_gl000199_random,chrUn_gl000245,chrUn_gl000242,chrUn_gl000216,chr7,chr19_gl000209_random,chrUn_gl000230,chr14,chrUn_gl000215,chrUn_gl000236,chr6_mcf_hap5,chr3,chrUn_gl000226,chr17_gl000203_random,chrM,chr17,chr6_dbb_hap3,chrUn_gl000246,chr4_gl000194_random,chr11_gl000202_random,chr17_ctg5_hap1,chrUn_gl000241,chr6_qbl_hap6,chr4,chrUn_gl000225,chrUn_gl000231,chrUn_gl000240,chr6,chr9,chr9_gl000201_random,chr4_ctg9_hap1,chrUn_gl000235,chrUn_gl000244,chrUn_gl000220,chrUn_gl000214,chr17_gl000205_random,chrX,chrUn_gl000239,chr6_ssto_hap7,chrUn_gl000218,chrUn_gl000229,chr10,chr21,chr16,chrUn_gl000247,chr8_gl000196_random at org.hammerlab.guacamole.reference.ReferenceBroadcast.getContig(ReferenceBroadcast.scala:18) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:123) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:116) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:116) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:104) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:65) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:55) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:141) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:131) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) at org.apache.spark.rdd.RDD.count(RDD.scala:1157) at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:108) at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:58) at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:12) at org.hammerlab.guacamole.commands.Command.run(Command.scala:27) at org.hammerlab.guacamole.Main$.main(Main.scala:49) at org.hammerlab.guacamole.Main.main(Main.scala) Caused by: org.hammerlab.guacamole.reference.ContigNotFound: Contig 19 does not exist in the current reference. Available contigs are chr11,chr1_gl000191_random,chr5,chr22,chr1_gl000192_random,chrUn_gl000221,chrUn_gl000234,chr18_gl000207_random,chrUn_gl000232,chrUn_gl000224,chr8,chrUn_gl000248,chrUn_gl000243,chrUn_gl000238,chrUn_gl000249,chrUn_gl000228,chr21_gl000210_random,chrUn_gl000217,chr19,chrY,chr9_gl000198_random,chr6_mann_hap4,chr1,chr4_gl000193_random,chr15,chrUn_gl000213,chr17_gl000204_random,chr19_gl000208_random,chrUn_gl000212,chr17_gl000206_random,chrUn_gl000219,chr12,chr18,chr20,chr8_gl000197_random,chr6_cox_hap2,chr2,chrUn_gl000223,chrUn_gl000237,chr9_gl000200_random,chr7_gl000195_random,chr13,chr6_apd_hap1,chrUn_gl000233,chrUn_gl000227,chrUn_gl000222,chrUn_gl000211,chr9_gl000199_random,chrUn_gl000245,chrUn_gl000242,chrUn_gl000216,chr7,chr19_gl000209_random,chrUn_gl000230,chr14,chrUn_gl000215,chrUn_gl000236,chr6_mcf_hap5,chr3,chrUn_gl000226,chr17_gl000203_random,chrM,chr17,chr6_dbb_hap3,chrUn_gl000246,chr4_gl000194_random,chr11_gl000202_random,chr17_ctg5_hap1,chrUn_gl000241,chr6_qbl_hap6,chr4,chrUn_gl000225,chrUn_gl000231,chrUn_gl000240,chr6,chr9,chr9_gl000201_random,chr4_ctg9_hap1,chrUn_gl000235,chrUn_gl000244,chrUn_gl000220,chrUn_gl000214,chr17_gl000205_random,chrX,chrUn_gl000239,chr6_ssto_hap7,chrUn_gl000218,chrUn_gl000229,chr10,chr21,chr16,chrUn_gl000247,chr8_gl000196_random at org.hammerlab.guacamole.reference.ReferenceBroadcast.getContig(ReferenceBroadcast.scala:18) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:123) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1$$anonfun$7.apply(PileupFlatMapUtils.scala:116) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:116) at org.hammerlab.guacamole.distributed.PileupFlatMapUtils$$anonfun$pileupFlatMapMultipleSamples$1.apply(PileupFlatMapUtils.scala:104) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:65) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$windowFlatMapWithState$1$$anonfun$apply$1.apply(WindowFlatMapUtils.scala:55) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:141) at org.hammerlab.guacamole.distributed.WindowFlatMapUtils$$anonfun$splitPartitionByContigAndMap$2.apply(WindowFlatMapUtils.scala:131) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

arahuja commented 7 years ago

Hi @car2008

The error is actually "ContigNotFound: Contig 19". I believe the files you are using are aligned to b37 where the chromosomes don't have the chr prefix. The file you provided, uscs.hg19.fasta has the chromosomes labeled at chr1, chr2, ... chr19 etc. If you have a b37 fasta file available or can download the Broad version available here: ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta

car2008 commented 7 years ago

Thank you @arahuja ,it can run well when i using the b37 fasta file .

arahuja commented 7 years ago

That is actually not something we have tested, we typically run on Yarn or locally, but interested to hear if there are issues.

In that stacktrace at least, it is recommending different parameters for the number of workers


SPARK_WORKER_INSTANCES was detected (set to '4').
This is deprecated in Spark 1.0+.

Please instead use:

   - ./spark-submit with --num-executors to specify the number of executors
   - Or set SPARK_EXECUTOR_INSTANCES
   - spark.executor.instances to configure the number of instances in the
   spark config.

Does one of those solutions resolve the issue?

On Mon, Sep 12, 2016 at 4:15 AM, ZhipengCheng notifications@github.com wrote:

Hi , i have another question: Can the guacamole work on the spark standalone mode? Just now i ran the command spark-submit --master spark://12.122.23.2:7070 --class org.hammerlab.guacamole.Main --jars target/guacamole-deps-only-0.0.1-SNAPSHOT.jar --verbose target/guacamole-0.0.1-SNAPSHOT.jar somatic-joint /home/zhipengcheng/file/tmp/dedup.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/ result/out.vcf,but it throws exception: Using properties file: /home/zhipengcheng/soft/spark- 1.6.1-bin-hadoop2.6/conf/spark-defaults.conf Adding default property: spark.shuffle.spill.compress=false Adding default property: spark.shuffle.manager=org. apache.spark.shuffle.sort.SortShuffleManager Adding default property: spark.shuffle.sort.bypassMergeThreshold=100 Adding default property: spark.shuffle.compress=false Adding default property: spark.shuffle.io.maxRetries=6 Adding default property: spark.storage.memoryFraction=0.5 Adding default property: spark.shuffle.io.retryWait=15s Adding default property: spark.shuffle.memoryFraction=0.1 Parsed arguments: master spark://12.122.23.2:7070 deployMode null executorMemory 25g executorCores null totalExecutorCores null propertiesFile /home/zhipengcheng/soft/spark-1.6.1-bin-hadoop2.6/conf/ spark-defaults.conf driverMemory null driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors null files null pyFiles null archives null mainClass org.hammerlab.guacamole.Main primaryResource file:/home/zhipengcheng/soft1/ guacamole/target/guacamole-0.0.1-SNAPSHOT.jar name org.hammerlab.guacamole.Main childArgs [somatic-joint /home/zhipengcheng/file/tmp/dedup.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf] jars file:/home/zhipengcheng/soft1/guacamole/target/guacamole- deps-only-0.0.1-SNAPSHOT.jar packages null packagesExclusions null repositories null verbose true

Spark properties used, including those specified through --conf and those from the properties file /home/zhipengcheng/soft/spark- 1.6.1-bin-hadoop2.6/conf/spark-defaults.conf: spark.shuffle.spill.compress -> false spark.shuffle.manager -> org.apache.spark.shuffle.sort.SortShuffleManager spark.shuffle.sort.bypassMergeThreshold -> 100 spark.shuffle.compress -> false spark.shuffle.io.retryWait -> 15s spark.shuffle.io.maxRetries -> 6 spark.storage.memoryFraction -> 0.5 spark.shuffle.memoryFraction -> 0.1

Main class: org.hammerlab.guacamole.Main Arguments: somatic-joint /home/zhipengcheng/file/tmp/dedup.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf System properties: spark.shuffle.spill.compress -> false spark.shuffle.manager -> org.apache.spark.shuffle.sort.SortShuffleManager spark.shuffle.sort.bypassMergeThreshold -> 100 spark.executor.memory -> 25g spark.shuffle.compress -> false spark.shuffle.io.retryWait -> 15s SPARK_SUBMIT -> true spark.app.name -> org.hammerlab.guacamole.Main spark.shuffle.io.maxRetries -> 6 spark.jars -> file:/home/zhipengcheng/soft1/guacamole/target/guacamole- deps-only-0.0.1-SNAPSHOT.jar,file:/home/zhipengcheng/soft1/ guacamole/target/guacamole-0.0.1-SNAPSHOT.jar spark.submit.deployMode -> client spark.master -> spark://12.122.23.2:7070 spark.shuffle.memoryFraction -> 0.1 spark.storage.memoryFraction -> 0.5 Classpath elements: file:/home/zhipengcheng/soft1/guacamole/target/guacamole-0. 0.1-SNAPSHOT.jar file:/home/zhipengcheng/soft1/guacamole/target/guacamole- deps-only-0.0.1-SNAPSHOT.jar

--> [Mon Sep 12 15:41:08 CST 2016]: Guacamole starting. 2016-09-12 15:41:08 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-09-12 15:41:08 WARN SparkConf:70 - Detected deprecated memory fraction settings: [spark.shuffle.memoryFraction, spark.storage.memoryFraction]. As of Spark 1.6, execution and storage memory management are unified. All memory fractions used in the old model are now deprecated and no longer read. If you wish to use the old memory management, you may explicitly enable spark.memory.useLegacyMode (not recommended). 2016-09-12 15:41:08 WARN SparkConf:70 - SPARK_WORKER_INSTANCES was detected (set to '4'). This is deprecated in Spark 1.0+.

Please instead use:

  • ./spark-submit with --num-executors to specify the number of executors
  • Or set SPARK_EXECUTOR_INSTANCES
  • spark.executor.instances to configure the number of instances in the spark config.

2016-09-12 15:41:11 WARN AppClient$ClientEndpoint:91 - Failed to connect to master 192.168.2.24:7070 java.io.IOException: Failed to connect to /12.122.23.2:7070 at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient( NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: 拒绝连接: /12.122.23.2:7070 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect( NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe. finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey( NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized( NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys( NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2. run(SingleThreadEventExecutor.java:111) ... 1 more 2016-09-12 15:41:30 WARN AppClient$ClientEndpoint:91 - Failed to connect to master 192.168.2.24:7070 java.io.IOException: Failed to connect to /12.122.23.2:7070 at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient( NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: 拒绝连接: /12.122.23.2:7070 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect( NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe. finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey( NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized( NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys( NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2. run(SingleThreadEventExecutor.java:111) ... 1 more 2016-09-12 15:41:50 WARN AppClient$ClientEndpoint:91 - Failed to connect to master 192.168.2.24:7070 java.io.IOException: Failed to connect to /12.122.23.2:7070 at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient( NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: 拒绝连接: /12.122.23.2:7070 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect( NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe. finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey( NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized( NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys( NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2. run(SingleThreadEventExecutor.java:111) ... 1 more 2016-09-12 15:42:10 ERROR SparkDeploySchedulerBackend:74 - Application has been killed. Reason: All masters are unresponsive! Giving up. 2016-09-12 15:42:10 WARN SparkDeploySchedulerBackend:70 - Application ID is not initialized yet. 2016-09-12 15:42:10 WARN AppClient$ClientEndpoint:91 - Failed to connect to master 192.168.2.24:7070 java.io.IOException: Failed to connect to /12.122.23.2:7070 at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:216) at org.apache.spark.network.client.TransportClientFactory.createClient( TransportClientFactory.java:167) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient( NettyRpcEnv.scala:200) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:183) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.ConnectException: 拒绝连接: /12.122.23.2:7070 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect( NioSocketChannel.java:224) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe. finishConnect(AbstractNioChannel.java:289) at io.netty.channel.nio.NioEventLoop.processSelectedKey( NioEventLoop.java:528) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized( NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys( NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2. run(SingleThreadEventExecutor.java:111) ... 1 more 2016-09-12 15:42:11 ERROR SparkContext:95 - Error initializing SparkContext. java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. This stopped SparkContext was created at:

org.apache.spark.SparkContext.(SparkContext.scala:82) org.hammerlab.guacamole.commands.SparkCommand.createSparkContext( SparkCommand.scala:65) org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:10) org.hammerlab.guacamole.commands.Command.run(Command.scala:27) org.hammerlab.guacamole.Main$.main(Main.scala:49) org.hammerlab.guacamole.Main.main(Main.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 57) sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The currently active SparkContext was created at:

(No active SparkContext.)

at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106) at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1578) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2179) at org.apache.spark.SparkContext.(SparkContext.scala:579) at org.hammerlab.guacamole.commands.SparkCommand.createSparkContext(SparkCommand.scala:65) at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:10) at org.hammerlab.guacamole.commands.Command.run(Command.scala:27) at org.hammerlab.guacamole.Main$.main(Main.scala:49) at org.hammerlab.guacamole.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Exception in thread "main" java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. This stopped SparkContext was created at:

org.apache.spark.SparkContext.(SparkContext.scala:82) org.hammerlab.guacamole.commands.SparkCommand.createSparkContext( SparkCommand.scala:65) org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:10) org.hammerlab.guacamole.commands.Command.run(Command.scala:27) org.hammerlab.guacamole.Main$.main(Main.scala:49) org.hammerlab.guacamole.Main.main(Main.scala) sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 57) sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:606) org.apache.spark.deploy.SparkSubmit$.org$apache$spark$ deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The currently active SparkContext was created at:

(No active SparkContext.)

at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:106) at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1578) at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2179) at org.apache.spark.SparkContext.(SparkContext.scala:579) at org.hammerlab.guacamole.commands.SparkCommand.createSparkContext(SparkCommand.scala:65) at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:10) at org.hammerlab.guacamole.commands.Command.run(Command.scala:27) at org.hammerlab.guacamole.Main$.main(Main.scala:49) at org.hammerlab.guacamole.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2016-09-12 15:42:11 WARN AppClient$ClientEndpoint:70 - Drop UnregisterApplication(null) because has not yet connected to master 2016-09-12 15:42:11 ERROR MapOutputTrackerMaster:95 - Error communicating with MapOutputTracker java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer. tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait( Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn( BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry( RpcEndpointRef.scala:101) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry( RpcEndpointRef.scala:77) at org.apache.spark.MapOutputTracker.askTracker( MapOutputTracker.scala:110) at org.apache.spark.MapOutputTracker.sendTracker( MapOutputTracker.scala:120) at org.apache.spark.MapOutputTrackerMaster.stop( MapOutputTracker.scala:462) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:93) at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp( SparkContext.scala:1756) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.SparkContext.stop(SparkContext.scala:1755) at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead( SparkDeploySchedulerBackend.scala:127) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead( AppClient.scala:264) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$ anonfun$run$1.apply$mcV$sp(AppClient.scala:134) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1163) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run( AppClient.scala:129) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2016-09-12 15:42:11 ERROR Utils:95 - Uncaught exception in thread appclient-registration-retry-thread org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spark.MapOutputTracker.askTracker( MapOutputTracker.scala:114) at org.apache.spark.MapOutputTracker.sendTracker( MapOutputTracker.scala:120) at org.apache.spark.MapOutputTrackerMaster.stop( MapOutputTracker.scala:462) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:93) at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp( SparkContext.scala:1756) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.SparkContext.stop(SparkContext.scala:1755) at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead( SparkDeploySchedulerBackend.scala:127) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead( AppClient.scala:264) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$ anonfun$run$1.apply$mcV$sp(AppClient.scala:134) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1163) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run( AppClient.scala:129) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer. tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait( Promise.scala:208) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn( BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry( RpcEndpointRef.scala:101) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry( RpcEndpointRef.scala:77) at org.apache.spark.MapOutputTracker.askTracker( MapOutputTracker.scala:110) ... 18 more 2016-09-12 15:42:11 ERROR SparkUncaughtExceptionHandler:95 - [Container in shutdown] Uncaught exception in thread Thread[appclient-registration- retry-thread,5,main] org.apache.spark.SparkException: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. at org.apache.spark.scheduler.TaskSchedulerImpl.error( TaskSchedulerImpl.scala:438) at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead( SparkDeploySchedulerBackend.scala:124) at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead( AppClient.scala:264) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$ anonfun$run$1.apply$mcV$sp(AppClient.scala:134) at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1163) at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run( AppClient.scala:129) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hammerlab/guacamole/issues/572#issuecomment-246278541, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb0S-AxEqA_l3-jW-1m304Vwc4KVeI2ks5qpQo3gaJpZM4J41ze .

car2008 commented 7 years ago

Hi @arahuja ,thank you, i have tried to run the command on spark local model firstly./scripts/guacamole somatic-joint /home/zhipengcheng/file/tmp/dedup.bam --reference-fasta /home/zhipengcheng/file/ucsc.hg19.fasta --out /home/zhipengcheng/file/result/out.vcf, the memory i'm using is 80g , the cores are 64, the file dedup.bam is 7g, but it throwed exception: Found guac jar: target/guacamole-0.0.1-SNAPSHOT.jar --> [Tue Sep 13 15:59:21 CST 2016]: Guacamole starting. 2016-09-13 15:59:22 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable --> [3.15 sec. later]: Using samtools without BAM index (index unavailable) to read: /home/zhipengcheng/file/tmp/dedup.bam --> [73.95 sec. later]: Partitioning loci Exception in thread "qtp420034849-106" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.spark-project.jetty.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:708) at org.spark-project.jetty.io.nio.SelectorManager$1.run(SelectorManager.java:290) at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) 2016-09-13 16:03:10 WARN AbstractConnector:955 - java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.FileDescriptor.(FileDescriptor.java:62) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:232) at org.spark-project.jetty.server.nio.SelectChannelConnector.accept(SelectChannelConnector.java:109) at org.spark-project.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:938) at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) 2016-09-13 16:03:10 WARN NettyRpcEndpointRef:91 - Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@76e55122,BlockManagerId(driver, localhost, 36128))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 14 more 2016-09-13 16:03:10 WARN NettyRpcEnv:70 - Ignored message: HeartbeatResponse(false) 2016-09-13 16:03:16 ERROR ActorSystemImpl:66 - Uncaught fatal error from thread [sparkDriverActorSystem-akka.remote.default-remote-dispatcher-34] shutting down ActorSystem [sparkDriverActorSystem] java.lang.OutOfMemoryError: GC overhead limit exceeded at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.growArray(ForkJoinPool.java:1090) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1978) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2016-09-13 16:03:16 WARN nio:498 - java.lang.InterruptedException at java.lang.Object.wait(Native Method) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:494) at org.spark-project.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:517) at org.spark-project.jetty.server.HttpOutput.write(HttpOutput.java:170) at org.spark-project.jetty.server.HttpOutput.write(HttpOutput.java:107) at java.util.zip.GZIPOutputStream.finish(GZIPOutputStream.java:165) at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238) at org.spark-project.jetty.http.gzip.AbstractCompressedStream.finish(AbstractCompressedStream.java:191) at org.spark-project.jetty.http.gzip.CompressedResponseWrapper.finish(CompressedResponseWrapper.java:347) at org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:296) at org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.spark-project.jetty.server.Server.handle(Server.java:370) at org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) 2016-09-13 16:03:16 WARN nio:498 - java.lang.InterruptedException at java.lang.Object.wait(Native Method) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:494) at org.spark-project.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:517) at org.spark-project.jetty.server.HttpOutput.write(HttpOutput.java:147) at org.spark-project.jetty.server.HttpOutput.write(HttpOutput.java:107) at java.util.zip.GZIPOutputStream.finish(GZIPOutputStream.java:165) at java.util.zip.DeflaterOutputStream.close(DeflaterOutputStream.java:238) at org.spark-project.jetty.http.gzip.AbstractCompressedStream.finish(AbstractCompressedStream.java:191) at org.spark-project.jetty.http.gzip.CompressedResponseWrapper.finish(CompressedResponseWrapper.java:347) at org.spark-project.jetty.server.handler.GzipHandler.handle(GzipHandler.java:296) at org.spark-project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.spark-project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.spark-project.jetty.server.Server.handle(Server.java:370) at org.spark-project.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.spark-project.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.spark-project.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.spark-project.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.spark-project.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.spark-project.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.spark-project.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.spark-project.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.spark-project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:745) 2016-09-13 16:03:16 WARN QueuedThreadPool:145 - 10 threads could not be stopped Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at htsjdk.samtools.BinaryTagCodec.readTags(BinaryTagCodec.java:282) at htsjdk.samtools.BAMRecord.decodeAttributes(BAMRecord.java:308) at htsjdk.samtools.BAMRecord.getAttribute(BAMRecord.java:288) at htsjdk.samtools.SAMRecord.getReadGroup(SAMRecord.java:691) at org.hammerlab.guacamole.reads.Read$.apply(Read.scala:77) at org.hammerlab.guacamole.readsets.ReadSets$$anonfun$8.apply(ReadSets.scala:276) at org.hammerlab.guacamole.readsets.ReadSets$$anonfun$8.apply(ReadSets.scala:266) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$class.toStream(Iterator.scala:1143) at scala.collection.AbstractIterator.toStream(Iterator.scala:1157) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.Stream.length(Stream.scala:284) at scala.collection.SeqLike$class.size(SeqLike.scala:106) at scala.collection.AbstractSeq.size(Seq.scala:40) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248) at scala.collection.AbstractTraversable.toArray(Traversable.scala:105) at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:154) at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(ParallelCollectionRDD.scala:97) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)

ryan-williams commented 7 years ago

Are you able to share that BAM with us, @car2008?

car2008 commented 7 years ago

Hi @ryan-williams , i'm sorry i can't share the BAM because it belongs to one of my customers , and i can tell you someting about the BAM: it was generated after comparing with hg19.fasta , and you can tell me other necessary informations you want to know .May the BAM have some problems ? Is there other public BAM file i can have a test on spark local model?

arahuja commented 7 years ago

@car2008 Is the BAM very large? In local mode, by default, we use samtools to read in the BAM which loads them all into memory. This is useful if you are providing the --loci argument to only look at a few positions. Otherwise, if you are not using --loci or the BAM is very large, I would suggest adding the flag ----bam-reader-api hadoopbam to use Hadoop-BAM to read in the BAM instead which will be more efficient.

This is issue is being addressed in #569 as well to simplify this.

ryan-williams commented 7 years ago

He mentioned above that it's a 7GB BAM; since that's compressed you could get arbitrary blow-up when reading the whole thing in, so not terribly surprising that it would blow up in local mode, I guess!

Aside: I'll give #569 a review today @arahuja, sorry for the delay. Does it let us stream over a local BAM in spark-local-mode? Do we not have a way to do that pre-#569?

car2008 commented 7 years ago

Hi @arahuja and @ryan-williams , thank you ,now i want to run the command in local mode : ./scripts/guacamole somatic-joint /mnt/hgfs/D/shiyan/tmp/dedup.bam --bam-reader-api hadoopbam --reference /mnt/hgfs/D/shiyan/ucsc.hg19.fasta --out /mnt/hgfs/D/shiyan/result/out.vcf using the latest release,but it throwed exception:

Using default {local,kryo} Spark config files
spark.serializer          org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator    org.hammerlab.guacamole.kryo.Registrar
spark.master  local[1]
spark.driver.memory 4g
--> [Mon Sep 19 09:40:31 CST 2016]: Guacamole starting.
"--bam-reader-api" is not a valid option
 VAL                                                    : FILE1 FILE2 FILE3
 --analytes STRING[]                                    : [dna|rna] ... [dna|rna]
 --any-allele-min-supporting-percent N                  : min percent of reads to call any allele (somatic or germline) (default: 2.0)
 --any-allele-min-supporting-reads N                    : min reads to call any allele (somatic or germline) (default: 2)
 --compress                                             : Whether to compress the output partitioned reads (default: false). (default: false)
 --debug                                                : If set, prints a higher level of debug output. (default: false)
 --filter-somatic-normal-depth N                        : filter somatic calls with total pooled normal dna depth lower than this value (default: 30)
 --filter-somatic-normal-nonreference-percent N         : filter somatic calls with pooled normal dna percent of reads not matching reference lower
                                                          than this value (default: 3.0)
 --filter-strand-bias-phred N                           : filter calls with strand bias p-value exceeding this phred-scaled value (default: 60.0)
 --force-call-loci VAL                                  : Always call the given sites (default: )
 --force-call-loci-file VAL                             : Always call the given sites (default: )
 --germline-negative-log10-heterozygous-prior N         : prior on a germline het call, higher number means fewer calls (default: 4.0)
 --germline-negative-log10-homozygous-alternate-prior N : prior on a germline hom-alt call (default: 5.0)
 --header-metadata STRING[]                             : Extra header metadata for VCF output in format KEY=VALUE KEY=VALUE ...
 --include-filtered                                     : Include filtered calls (default: false)
 --loci VAL                                             : Loci at which to call variants. Either 'all' or contig:start-end,contig:start-end,...
                                                          (default: )
 --loci-file VAL                                        : Path to file giving loci at which to call variants. (default: )
 --loci-partitioner VAL                                 : Loci partitioner to use: 'exact', 'approximate', or 'uniform' (default: 'exact'). (default:
                                                          exact)
 --loci-partitioning path                               : Load a LociPartitioning from this path if it exists; else write a computed partitioning to
                                                          this path. (default: None)
 --max-alleles-per-site N                               : maximum number of alt alleles to consider at a site (default: 4)
 --max-calls-per-site N                                 : maximum number of calls to make at a site. 0 indicates unlimited (default: 1)
 --max-reads-per-partition N                            : Maximum number of reads to allow any one partition to have. Loci that have more depth than
                                                          this will be dropped. (default: 100000)
 --no-sequence-dictionary                               : If set, get contigs and lengths directly from reads instead of from sequence dictionary.
                                                          (default: false)
 --only-somatic                                         : Output only somatic calls, no germline calls (default: false)
 --out VAL                                              : Output path for all variants in VCF. Default: no output (default: )
 --out-dir VAL                                          : Output dir for all variants, split into separate files for somatic/germline (default: )
 --parallelism N                                        : Number of variant calling partitions. Set to 0 (default) to use the number of Spark
                                                          partitions. (default: 0)
 --partial-reference                                    : Treat the reference fasta as a "partial FASTA", comprised of segments (possibly in the
                                                          interior) of contigs. (default: false)
 --partition-accuracy N                                 : Number of micro-partitions to use, per Spark partition, when partitioning loci (default:
                                                          250). Set to 0 to partition loci uniformly (default: 250)
 --partitioned-reads path                               : Directory from which to read an existing partition-reads RDD, if it exists; otherwise, save
                                                          it here. (default: None)
 --partitioning-dir path                                : Directory from which to read an existing partition-reads RDD (and accompanying
                                                          LociPartitioning), if the directory exists; otherwise, save them here. If set, precludes
                                                          use of --partitioned-reads and --loci-partitioning (default: None)
 --partitioning-stats                                   : Compute additional statistics about the partitioned loci and reads; causes additional Spark
                                                          jobs to be run, so should be disabled in performance-critical environments. Default: false.
                                                          (default: false)
 --reference VAL                                        : Local path to a reference FASTA file
 --sample-names STRING[]                                : name1 ... nameN
 --somatic-genotype-policy VAL                          : how to genotype (e.g. 0/0 vs. 0/1) somatic calls. Valid options: 'presence' (default),
                                                          'trigger'. 'presence' will genotype as a het (0/1) if there is any evidence for that call
                                                          in the sample (num variant reads > 0 and num variant reads > any other non-reference
                                                          allele). 'trigger' will genotype a call as a het only if that sample actually triggered the
                                                          call. (default: presence)
 --somatic-max-germline-error-rate-percent N            : max germline error rate percent to have a somatic call (default: 4.0)
 --somatic-negative-log10-variant-prior N               : prior on somatic call, higher number means fewer calls (default: 6.0)
 --somatic-vaf-floor N                                  : min VAF to use in the likelihood calculation (default: 0.05)
 --tissue-types STRING[]                                : [normal|tumor] ... [normal|tumor]
 -h (-help, --help, -?)                                 : Print help (default: false)
 -print_metrics                                         : Print metrics to the log on completion (default: false)

real    0m0.936s
user    0m0.537s
sys 0m0.245s

Is there not the flag --bam-reader-api hadoopbam in the latest release? How can i use Hadoop-BAM to read in the BAM?

ryan-williams commented 7 years ago

You got it, --bam-reader-api is now gone, as of #569. Just leave that flag out altogether and you will be using Hadoop-BAM!

car2008 commented 7 years ago

Ok @ryan-williams , i know it , and i have tried the command ./scripts/guacamole somatic-joint /mnt/hgfs/D/shiyan/tmp/dedup.bam --reference /mnt/hgfs/D/shiyan/reference-human/ucsc.hg19.fasta --out /mnt/hgfs/D/shiyan/result/out.vcf in local mode. It runs well when the BAM file is 97K or 97M ,but throws exception when the BAM increased to 7G . The memory i'm using is 10g , the cpu cores are 2 .Was it because the memory too small or other reason like network? :

Using default {local,kryo} Spark config files
spark.serializer          org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator    org.hammerlab.guacamole.kryo.Registrar
spark.master  local[1]
spark.driver.memory 4g
--> [Mon Sep 19 12:46:38 CST 2016]: Guacamole starting.
2016-09-19 12:46:39 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--> [167.16 sec. later]: Partitioning loci
2016-09-19 14:05:52 ERROR Executor:74 - Managed memory leak detected; size = 412345478 bytes, TID = 362
2016-09-19 14:05:52 ERROR Executor:95 - Exception in task 138.0 in stage 1.0 (TID 362)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 4267
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-09-19 14:05:52 WARN  TaskSetManager:70 - Lost task 138.0 in stage 1.0 (TID 362, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 4267
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

2016-09-19 14:05:53 ERROR TaskSetManager:74 - Task 138 in stage 1.0 failed 1 times; aborting job
2016-09-19 14:05:53 WARN  TaskSetManager:70 - Lost task 139.0 in stage 1.0 (TID 363, localhost): TaskKilled (killed intentionally)
2016-09-19 14:05:54 ERROR DiskBlockManager:95 - Exception while deleting local spark dir: /mnt/hgfs/C/spark05-tmp/blockmgr-eedd40c6-6e43-49c3-8458-fd96c459cf45
java.io.IOException: Failed to delete: /mnt/hgfs/C/spark05-tmp/blockmgr-eedd40c6-6e43-49c3-8458-fd96c459cf45
    at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:928)
    at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:174)
    at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:170)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:170)
    at org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:162)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1233)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
    at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1756)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1755)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:14)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
    at org.hammerlab.guacamole.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 138 in stage 1.0 failed 1 times, most recent failure: Lost task 138.0 in stage 1.0 (TID 362, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 4267
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD$$anonfun$coverage$1.apply(CoverageRDD.scala:54)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD$$anonfun$coverage$1.apply(CoverageRDD.scala:54)
    at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
    at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD.coverage(CoverageRDD.scala:32)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD.makeCappedLociSets(CoverageRDD.scala:67)
    at org.hammerlab.guacamole.loci.partitioning.CappedRegionsPartitioner.partition(CappedRegionsPartitioner.scala:109)
    at org.hammerlab.guacamole.loci.partitioning.LociPartitioning$.apply(LociPartitioning.scala:120)
    at org.hammerlab.guacamole.readsets.rdd.PartitionedRegions$.apply(PartitionedRegions.scala:135)
    at org.hammerlab.guacamole.commands.SomaticJoint$.makeCalls(SomaticJointCaller.scala:166)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:91)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:56)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:12)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
    at org.hammerlab.guacamole.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 4267
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

real    79m18.773s
user    78m51.337s
sys 8m46.174s
ryan-williams commented 7 years ago

It looks to have failed due to a Kryo exception:

Caused by: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 4267

I'm not sure otomh what would cause that, but have a couple of ideas.

One possibility is that there is an issue with Java versions, since as of #569 we require Java 8!

I don't think it has to do with your memory settings / limits.

I'll try to run some jobs myself and see if I can figure out what may be happening.

ryan-williams commented 7 years ago

What version of Spark are you running on? (i.e. does $SPARK_HOME point to).

The "managed memory leak detected", and logged at Executor:74, is suspicious to me (I don't see any version of Spark where Executor.scala:74 could have logged that message), but that's as far as I've gotten.

I have a couple ideas for printing more debugging info we can try as well, stand by.

car2008 commented 7 years ago

Hi @ryan-williams ,thanks for your advices ,the spark version is 1.6.1 and the java version is "1.8.0_101". I have ran the command successfully using the 7G BAM which generated the 74M VCF file in spark1.6.1 local mode after i changed to another server which has 100g memory and 48 cores:./scripts/guacamole somatic-joint /home/bioinfo/zhipengcheng/file/dedup.bam --reference /home/bioinfo/zhipengcheng/file/ucsc.hg19.fasta --out /home/bioinfo/zhipengcheng/file/result/out.vcf . The console printed :

Using default {local,kryo} Spark config files
spark.serializer          org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator    org.hammerlab.guacamole.kryo.Registrar
spark.master  local[1]
spark.driver.memory 10g
--> [Wed Sep 21 17:23:14 CST 2016]: Guacamole starting.
2016-09-21 17:23:15 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--> [57.15 sec. later]: Partitioning loci
--> [9053.80 sec. later]: Partitioned loci: 1049 partitions.
Partition-size stats:
num:    1049,   mean:   789966.1,   stddev: 405879.6,   mad:    272653
elems:  817449, 290329, 1280665, 1821549, 911489, 1014036, 48160, 1231108, 1023372, 1052595, …, 649050, 505449, 1048721, 351333, 639272, 313225, 1052318, 550549, 531763, 1606956
sorted: 4734, 5462, 5728, 6449, 11514, 13501, 14946, 16547, 17107, 19255, …, 1922236, 1924294, 1925612, 2028128, 2035102, 2052936, 2075179, 2153283, 2407485, 2536919
0.1:    5715.2
1:  26171.2
5:  156923.2
10: 313447.8
25: 492129
50: 757261
75: 1035206
90: 1332269.2
95: 1513677.4
99: 1890262.8
99.9:   2395283.3

Contigs-spanned-per-partition stats:
num:    1049,   mean:   1.1,    stddev: 1.1,    mad:    0
elems:  1×24, 3, 1×28, 2, 1×44, 2, 1×53, 2, 1×7, 2, …, 1×54, 4, 1×100, 2, 1×145, 7, 1×3, 3, 2, 1×2
sorted: 1×1019, 2×19, 3×5, 4, 5, 6, 7, 16, 31
0.1:    1
1:  1
5:  1
10: 1
25: 1
50: 1
75: 1
90: 1
95: 1
99: 2.5
99.9:   15.6
--> [84.22 sec. later]: Partitioning reads into 1049 partitions according to loci partitioning:
chr1:10001-10233=0,chr1:10269-10429=0,chr1:10438-10566=0,chr1:11341-11816=0,chr1:11820-13755=0,chr1:13757-13938=0,chr1:13969-14464=0,chr1:14574-14724=0,chr1:14733-15820=0,chr1:15933-16315=0,chr1:16316-16417=0,chr1:16447-16726=0,chr1:16838-17674=0,chr1:17677-18114=0,chr1:18302-18403=0,chr1:18417-18518=0,chr1:18552-18807=0,chr1:18869-19604=0,chr1:19860-20813=0,chr1:21069-21170=0,chr1:21176-21277=0,chr1:21279-21380=0,chr1:21399-21500=0,chr1:21603-21818=0,chr1:22276-22377=0,chr1:22393-22494=0, [...]
2016-09-21 20:27:41 WARN  TaskSetManager:70 - Stage 6 contains a task of very large size (131 KB). The maximum recommended task size is 100 KB.
--> [26131.06 sec. later]: Collecting evidence for 431,499 sites with calls
2016-09-22 03:12:12 WARN  TaskSetManager:70 - Stage 8 contains a task of very large size (131 KB). The maximum recommended task size is 100 KB.
--> [66.15 sec. later]: Called 431,499 germline and 0 somatic variants.
--> [0.01 sec. later]: Writing 431,499 calls across 1 samples  to /home/bioinfo/zhipengcheng/file/result/out.vcf
--> [49.46 sec. later]: Done.

real    590m44.670s
user    1844m27.868s
sys 9m32.235s

Sorry for another question : i want to reduce the processing time in local mode, could you give me some advices? At the same time i have tested the same command using the 400M BAM file in the server i used to use( the memory 10g , the cores 2 ) ,but it still throwed the same exception ,later i will try to print more debugging infomations to show :

Using default {local,kryo} Spark config files
spark.serializer          org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator    org.hammerlab.guacamole.kryo.Registrar
spark.master  local[1]
spark.driver.memory 4g
--> [Wed Sep 21 17:32:17 CST 2016]: Guacamole starting.
2016-09-21 17:32:18 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
--> [188.95 sec. later]: Partitioning loci
2016-09-21 18:31:29 ERROR Executor:74 - Managed memory leak detected; size = 1355742798 bytes, TID = 36
2016-09-21 18:31:29 ERROR Executor:95 - Exception in task 0.0 in stage 4.0 (TID 36)
com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 2422353
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:103)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-09-21 18:31:29 WARN  TaskSetManager:70 - Lost task 0.0 in stage 4.0 (TID 36, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 2422353
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:103)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

2016-09-21 18:31:29 ERROR TaskSetManager:74 - Task 0 in stage 4.0 failed 1 times; aborting job
2016-09-21 18:31:29 WARN  TaskSetManager:70 - Lost task 1.0 in stage 4.0 (TID 37, localhost): TaskKilled (killed intentionally)
2016-09-21 18:31:30 ERROR DiskBlockManager:95 - Exception while deleting local spark dir: /mnt/hgfs/C/spark05-tmp/blockmgr-628ef40b-be2c-4642-9254-526d95166d6e
java.io.IOException: Failed to delete: /mnt/hgfs/C/spark05-tmp/blockmgr-628ef40b-be2c-4642-9254-526d95166d6e
    at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:928)
    at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:174)
    at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:170)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:170)
    at org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:162)
    at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1233)
    at org.apache.spark.SparkEnv.stop(SparkEnv.scala:96)
    at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1756)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
    at org.apache.spark.SparkContext.stop(SparkContext.scala:1755)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:14)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
    at org.hammerlab.guacamole.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 36, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 2422353
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:103)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.hammerlab.guacamole.loci.partitioning.CappedRegionsPartitioner.partition(CappedRegionsPartitioner.scala:114)
    at org.hammerlab.guacamole.loci.partitioning.LociPartitioning$.apply(LociPartitioning.scala:120)
    at org.hammerlab.guacamole.readsets.rdd.PartitionedRegions$.apply(PartitionedRegions.scala:135)
    at org.hammerlab.guacamole.commands.SomaticJoint$.makeCalls(SomaticJointCaller.scala:166)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:91)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:56)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:12)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
    at org.hammerlab.guacamole.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 2422353
    at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
    at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
    at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
    at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
    at org.apache.spark.serializer.DeserializationStream.readValue(Serializer.scala:171)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:197)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:103)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

real    59m13.288s
user    93m37.721s
sys 6m17.956s`
ryan-williams commented 7 years ago

Thanks for that info! Interesting that throwing more memory+cores (edit: it looks like you only used more memory, 10g vs. 4g, in the run that passed above; see discussion below). at it fixes the problem; also weird that we'd get that Kryo error if the problem was memory pressure. I'd love to be able to reproduce that locally to dig further, I've tried a few things but haven't seen it locally yet.

One thing that might shed a little more light on the crashing app is to enable INFO logging in Spark, using their default log4j configuration:

mv $SPARK_HOME/conf/log4j.properties{.template,}

It's been really helpful to have all the output here for each case so far, thanks for including that! It might reduce clutter here if you put your output into a gist, especially if you start including all the INFO logs, per above.

Crashing app

Your kryo crash is occurring above during the coverage-depth-computation stage; the memory used there should be proportional to:

  1. The number of loci covered (up to 3B for a whole genome).
    • This is because one record is output from the "mappers" for each genomic loci seen.
    • Overages here should result in "spilling" to disk, which should impact performance but not the stability of the job as a whole.
      • However, in the past we've observed Spark apps to generally degrade and crash once spilling is occurring.
      • INFO logging, per above, will make it clear whether this is happening.
  2. The maximum depth of reads in the BAM.

In general, the number of reads in / size of the BAM shouldn't affect memory usage, so the bigger BAM should only fail if it is also including particularly-high-depth regions. If you can narrow down the extent to which that may be happening / what depths you think you may be dealing with, that may also help us figure out what's going on.

As an aside, I was able to reproduce the "Managed memory leak" error msg before a crash after taking a working run similar to the command line you gave and reducing the driver memory until it crashed. Notable error msgs:

2016-09-21 18:47:54 ERROR Executor:74 - Managed memory leak detected; size = 96982998 bytes, TID = 132
2016-09-21 18:47:54 ERROR Executor:95 - Exception in task 6.0 in stage 3.0 (TID 132)
java.lang.OutOfMemoryError: GC overhead limit exceeded

So I guess that message, as well as the confusing/nonsensical Executor:74 line-number, seem to be just a piece of an out-of-memory death-rattle. What specifically caused your run to die with Kryo exceptions is still unclear to me, but hopefully we can figure that out.

Optimizing working app

Glad to hear that your run succeeded on a 100GB, 48-core server; what's really interesting about it is that despite the machine having those resources, your run still had default configurations: 1 core, 10GB driver:

spark.master  local[1]
spark.driver.memory 10g

Did the other machine, from the crashing version, actually have 10GB of RAM available?

The shortest answer for making it run faster is to use all the cores on the more powerful machine!

To do that, you need to tell Guacamole to use different spark-config files than conf/local which it uses by default; here's an example:

# Put some new Spark configs in files in the conf/ directory
echo "spark.master    local[*]" > conf/local-multi
echo "spark.driver.memory 16g" > conf/driver-16g

# Set GUAC_SPARK_CONFS to the files you want to use for a given run.
export GUAC_SPARK_CONFS=conf/local-multi,conf/driver-16g

# Run Guacamole!
scripts/guacamole somatic-joint …

My apologies if using the conf directory / $GUAC_SPARK_CONFS this way feels clumsy, I just added that interface in #579 a few days ago, so we may decide to change it, but hopefully the example above gives you a sense of how to configure Spark configs to your liking.

In particular, I'm interested to see how your runtime changes using local[*] (48 cores) instead of local[1] (1 core)!

ryan-williams commented 7 years ago

Looking back at this, I noticed that, despite the machines being used having 10GB and 100GB of available memory (if I understand you correctly), your apps are running with spark.driver.memory of 4g (fails) and 10g (succeeds), respectively.

That makes reasonable sense, as running over a whole genome, with a .fasta reference, will likely always use more than 4g; I'm doing some benchmarking locally to narrow down that relationship a bit further, but for now I think experimenting with using more cores, and giving the driver more memory, as I sketched above, should help you to not crash and to run much faster. Let me know how that goes!

car2008 commented 7 years ago

Hi @ryan-williams , thanks for your advices and i'm trying to test them as you mentioned above now. I will show you some results of the testing : The runtime is real 319m6.793s when using thedriver.memory10g, local[4] and real 347m35.511s using the driver.memory16g, local[24] ,respectively. But it throwed exception "OOM" using driver.memory10g , local[6] and driver.memory10g , local[8] .Yet it throwed exception using driver.memory16g, local[*](48 cores)( part informatin of the console ):

16/09/24 22:18:11 INFO ExternalAppendOnlyMap: Thread 313 spilling in-memory map of 188.2 MB to disk (2 times so far)
16/09/24 22:18:25 INFO ExternalAppendOnlyMap: Thread 316 spilling in-memory map of 156.8 MB to disk (2 times so far)
16/09/24 22:18:25 INFO ExternalAppendOnlyMap: Thread 198 spilling in-memory map of 195.8 MB to disk (2 times so far)
Exception in thread "qtp178603358-172" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1855)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2068)
    at org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
    at org.spark-project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
    at org.spark-project.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
    at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "qtp178603358-400" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.addConditionWaiter(AbstractQueuedSynchronizer.java:1855)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2068)
    at org.spark-project.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:342)
    at org.spark-project.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:526)
    at org.spark-project.jetty.util.thread.QueuedThreadPool.access$600(QueuedThreadPool.java:44)
    at org.spark-project.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:25:19 INFO ExternalAppendOnlyMap: Thread 221 spilling in-memory map of 183.0 MB to disk (2 times so far)
Exception in thread "qtp178603358-406" java.lang.OutOfMemoryError: GC overhead limit exceeded
16/09/24 22:25:53 ERROR Executor: Managed memory leak detected; size = 152035235 bytes, TID = 245
16/09/24 22:25:53 ERROR Executor: Exception in task 21.0 in stage 1.0 (TID 245)
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:26:07 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-34,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:26:07 INFO SparkContext: Invoking stop() from shutdown hook
16/09/24 22:26:24 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@d5667a9,BlockManagerId(driver, localhost, 44173))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 14 more
16/09/24 22:26:24 INFO TaskSetManager: Starting task 48.0 in stage 1.0 (TID 272, localhost, partition 48,NODE_LOCAL, 2042 bytes)
16/09/24 22:26:24 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID 245, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

16/09/24 22:26:24 ERROR TaskSetManager: Task 21 in stage 1.0 failed 1 times; aborting job
16/09/24 22:26:24 INFO Executor: Running task 48.0 in stage 1.0 (TID 272)
16/09/24 22:26:24 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/24 22:26:24 INFO TaskSchedulerImpl: Cancelling stage 1
16/09/24 22:26:24 INFO Executor: Executor is trying to kill task 9.0 in stage 1.0 (TID 233)
16/09/24 22:26:24 INFO TaskSchedulerImpl: Stage 1 was cancelled
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 39.0 in stage 1.0 (TID 263)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 31.0 in stage 1.0 (TID 255)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 10.0 in stage 1.0 (TID 234)
16/09/24 22:26:38 INFO DAGScheduler: ResultStage 1 (sortByKey at CoverageRDD.scala:54) failed in 946.467 s
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 32.0 in stage 1.0 (TID 256)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 11.0 in stage 1.0 (TID 235)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 3.0 in stage 1.0 (TID 227)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 33.0 in stage 1.0 (TID 257)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 4.0 in stage 1.0 (TID 228)
16/09/24 22:26:38 INFO Executor: Executor is trying to kill task 34.0 in stage 1.0 (TID 258)
16/09/24 22:26:38 INFO DAGScheduler: Job 0 failed: sortByKey at CoverageRDD.scala:54, took 2324.644653 s
16/09/24 22:26:55 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/24 22:26:55 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@d5667a9,BlockManagerId(driver, localhost, 44173))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 14 more
16/09/24 22:26:55 INFO Executor: Executor is trying to kill task 26.0 in stage 1.0 (TID 250)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 5.0 in stage 1.0 (TID 229)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 27.0 in stage 1.0 (TID 251)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 6.0 in stage 1.0 (TID 230)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 28.0 in stage 1.0 (TID 252)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 29.0 in stage 1.0 (TID 253)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 0.0 in stage 1.0 (TID 224)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 22.0 in stage 1.0 (TID 246)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 1.0 in stage 1.0 (TID 225)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 44.0 in stage 1.0 (TID 268)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 23.0 in stage 1.0 (TID 247)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 2.0 in stage 1.0 (TID 226)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 45.0 in stage 1.0 (TID 269)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 24.0 in stage 1.0 (TID 248)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 46.0 in stage 1.0 (TID 270)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 25.0 in stage 1.0 (TID 249)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 17.0 in stage 1.0 (TID 241)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 47.0 in stage 1.0 (TID 271)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 18.0 in stage 1.0 (TID 242)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 48.0 in stage 1.0 (TID 272)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 40.0 in stage 1.0 (TID 264)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 19.0 in stage 1.0 (TID 243)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 41.0 in stage 1.0 (TID 265)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 20.0 in stage 1.0 (TID 244)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 12.0 in stage 1.0 (TID 236)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 42.0 in stage 1.0 (TID 266)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 13.0 in stage 1.0 (TID 237)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 43.0 in stage 1.0 (TID 267)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 35.0 in stage 1.0 (TID 259)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 14.0 in stage 1.0 (TID 238)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 36.0 in stage 1.0 (TID 260)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 15.0 in stage 1.0 (TID 239)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 7.0 in stage 1.0 (TID 231)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 37.0 in stage 1.0 (TID 261)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 16.0 in stage 1.0 (TID 240)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 8.0 in stage 1.0 (TID 232)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 38.0 in stage 1.0 (TID 262)
16/09/24 22:27:08 INFO Executor: Executor is trying to kill task 30.0 in stage 1.0 (TID 254)
16/09/24 22:26:55 WARN QueuedThreadPool: 25 threads could not be stopped
16/09/24 22:27:08 ERROR Executor: Managed memory leak detected; size = 162170912 bytes, TID = 229
16/09/24 22:27:08 ERROR Executor: Exception in task 48.0 in stage 1.0 (TID 272)
org.apache.spark.TaskKilledException
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:204)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:08 ERROR Executor: Managed memory leak detected; size = 158792352 bytes, TID = 239
16/09/24 22:27:08 ERROR Executor: Managed memory leak detected; size = 158792352 bytes, TID = 267
16/09/24 22:27:08 INFO Executor: Executor killed task 15.0 in stage 1.0 (TID 239)
16/09/24 22:27:08 INFO SparkUI: Stopped Spark web UI at http://192.168.2.25:4040
Exception in thread "dispatcher-event-loop-30" java.lang.OutOfMemoryError: GC overhead limit exceeded
16/09/24 22:27:21 INFO SparkContext: SparkContext already stopped.
16/09/24 22:27:21 INFO Executor: Executor killed task 43.0 in stage 1.0 (TID 267)
16/09/24 22:27:21 INFO Executor: Executor killed task 5.0 in stage 1.0 (TID 229)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1.0 failed 1 times, most recent failure: Lost task 21.0 in stage 1.0 (TID 245, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:264)
    at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:126)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
    at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD$$anonfun$coverage$1.apply(CoverageRDD.scala:54)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD$$anonfun$coverage$1.apply(CoverageRDD.scala:54)
    at scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:189)
    at scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:91)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD.coverage(CoverageRDD.scala:32)
    at org.hammerlab.guacamole.readsets.rdd.CoverageRDD.makeCappedLociSets(CoverageRDD.scala:67)
    at org.hammerlab.guacamole.loci.partitioning.CappedRegionsPartitioner.partition(CappedRegionsPartitioner.scala:109)
    at org.hammerlab.guacamole.loci.partitioning.LociPartitioning$.apply(LociPartitioning.scala:120)
    at org.hammerlab.guacamole.readsets.rdd.PartitionedRegions$.apply(PartitionedRegions.scala:135)
    at org.hammerlab.guacamole.commands.SomaticJoint$.makeCalls(SomaticJointCaller.scala:166)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:91)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:56)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:12)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
    at org.hammerlab.guacamole.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:201)
    at org.apache.spark.serializer.DeserializationStream$$anon$2.getNext(Serializer.scala:198)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:42 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@d5667a9,BlockManagerId(driver, localhost, 44173))] in 3 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 14 more
16/09/24 22:27:42 ERROR Executor: Managed memory leak detected; size = 158792352 bytes, TID = 251
16/09/24 22:27:42 INFO Executor: Executor killed task 27.0 in stage 1.0 (TID 251)
16/09/24 22:27:42 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@d5667a9,BlockManagerId(driver, localhost, 44173))]
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
    ... 13 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    ... 14 more
16/09/24 22:27:42 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 239, localhost): TaskKilled (killed intentionally)
16/09/24 22:27:42 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:44173 in memory (size: 1117.0 B, free: 10.0 GB)
16/09/24 22:27:42 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
16/09/24 22:27:42 WARN TaskSetManager: Lost task 43.0 in stage 1.0 (TID 267, localhost): TaskKilled (killed intentionally)
16/09/24 22:27:42 WARN TaskSetManager: Lost task 5.0 in stage 1.0 (TID 229, localhost): TaskKilled (killed intentionally)
16/09/24 22:27:42 WARN TaskSetManager: Lost task 27.0 in stage 1.0 (TID 251, localhost): TaskKilled (killed intentionally)
16/09/24 22:27:42 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@29908010,null)
16/09/24 22:27:42 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@1a4cc65a,null)
16/09/24 22:27:42 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerTaskEnd(1,0,ResultTask,TaskKilled,org.apache.spark.scheduler.TaskInfo@5b4e0d40,null)
16/09/24 22:27:42 WARN TaskSetManager: Lost task 48.0 in stage 1.0 (TID 272, localhost): org.apache.spark.TaskKilledException
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:204)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

16/09/24 22:27:42 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/24 22:27:42 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@25b0c8e7,BlockManagerId(driver, localhost, 44173))] in 1 attempts
org.apache.spark.SparkException: Could not find HeartbeatReceiver or it has been stopped.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:161)
    at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:126)
    at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:227)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:511)
    at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:100)
    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/3e/temp_local_cb74456c-7b8a-49c7-9084-562dee7449d6
java.io.FileNotFoundException: /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/3e/temp_local_cb74456c-7b8a-49c7-9084-562dee7449d6 (no such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:228)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:93)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/21/temp_local_92a22279-47a1-4fd4-a444-bb7b546255c5
java.io.FileNotFoundException: /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/21/temp_local_92a22279-47a1-4fd4-a444-bb7b546255c5 (no such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:228)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:93)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 ERROR Executor: Managed memory leak detected; size = 158792352 bytes, TID = 260
16/09/24 22:27:44 ERROR Executor: Exception in task 36.0 in stage 1.0 (TID 260)
java.io.FileNotFoundException: /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/3e/temp_local_cb74456c-7b8a-49c7-9084-562dee7449d6 (no such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:206)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:93)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 ERROR Executor: Managed memory leak detected; size = 158792352 bytes, TID = 248
16/09/24 22:27:44 ERROR Executor: Exception in task 24.0 in stage 1.0 (TID 248)
java.io.FileNotFoundException: /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/21/temp_local_92a22279-47a1-4fd4-a444-bb7b546255c5 (no such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:206)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:93)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 INFO MemoryStore: MemoryStore cleared
16/09/24 22:27:44 INFO BlockManager: BlockManager stopped
16/09/24 22:27:44 INFO BlockManagerMaster: BlockManagerMaster stopped
16/09/24 22:27:44 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/24 22:27:44 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/09/24 22:27:44 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/09/24 22:27:44 INFO SparkContext: Successfully stopped SparkContext
16/09/24 22:27:44 INFO ShutdownHookManager: Shutdown hook called
16/09/24 22:27:44 INFO ShutdownHookManager: Deleting directory /home/bioinfo/zhipengcheng/file/tmp/spark-6d5fa119-4bc7-4c91-9fee-6c5d4257356b/httpd-89ef16d5-d6e2-482f-bdae-14628c7e7d41
16/09/24 22:27:44 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/09/24 22:27:44 ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/2e/temp_local_2f25d1e2-6f7d-4e88-9344-605da960be39
java.io.FileNotFoundException: /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/2e/temp_local_2f25d1e2-6f7d-4e88-9344-605da960be39 (no such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:162)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:228)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:93)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 ERROR Utils: Uncaught exception in thread Executor task launch worker-58
java.lang.NullPointerException
    at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95)
    at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
    at org.apache.spark.scheduler.Task.run(Task.scala:93)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 ERROR Executor: Managed memory leak detected; size = 158792352 bytes, TID = 265
16/09/24 22:27:44 ERROR Executor: Exception in task 41.0 in stage 1.0 (TID 265)
java.io.FileNotFoundException: /home/bioinfo/zhipengcheng/file/tmp/blockmgr-f7ac149c-fd99-45a3-a917-08317e6d044c/2e/temp_local_2f25d1e2-6f7d-4e88-9344-605da960be39 (no such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:88)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:181)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:206)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:93)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
    at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "Executor task launch worker-58" java.lang.IllegalStateException: RpcEnv already stopped.
    at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159)
    at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131)
    at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
    at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516)
    at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/09/24 22:27:44 INFO ShutdownHookManager: Deleting directory /home/bioinfo/zhipengcheng/file/tmp/spark-6d5fa119-4bc7-4c91-9fee-6c5d4257356b

real    40m59.023s
user    808m4.113s
sys 3m47.651s

So i have rechecked the server : the logical CPU number of the server are 48 , the physical CPU number are 24 , the available memory is 122g .Also i find spilling occurred in all of the process from the spark info. So i think the bottleneck of the ptimizing is the combination of the physical CPU and memory, do you think so ?

To the crashing app , the server informations : the physical CPU number are zero , the logical CPU are 4 and the available memory is 10g because it's a virtual machine (CentOS-6.5) builted in windows system (Of course , the server(48cores,100gMemory ) above is a real physical machine). The command ran successfully using 400M BAM file generated 14.8M VCF file with the spark-config driver.memory4g, local[1] after i rebooted the server. But it still throwed exception when using 7G BAM file ( part informations of the console):

16/09/24 18:14:02 INFO Executor: Finished task 219.0 in stage 4.0 (TID 891). 336305 bytes result sent to driver
16/09/24 18:14:02 INFO TaskSetManager: Starting task 220.0 in stage 4.0 (TID 892, localhost, partition 220,NODE_LOCAL, 2042 bytes)
16/09/24 18:14:02 INFO Executor: Running task 220.0 in stage 4.0 (TID 892)
16/09/24 18:14:02 INFO ShuffleBlockFetcherIterator: Getting 224 non-empty blocks out of 224 blocks
16/09/24 18:14:02 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/09/24 18:14:02 INFO TaskSetManager: Finished task 219.0 in stage 4.0 (TID 891) in 14795 ms on localhost (220/224)
16/09/24 18:14:15 INFO Executor: Finished task 220.0 in stage 4.0 (TID 892). 370486 bytes result sent to driver
16/09/24 18:14:15 INFO TaskSetManager: Starting task 221.0 in stage 4.0 (TID 893, localhost, partition 221,NODE_LOCAL, 2042 bytes)
16/09/24 18:14:15 INFO Executor: Running task 221.0 in stage 4.0 (TID 893)
16/09/24 18:14:15 INFO ShuffleBlockFetcherIterator: Getting 224 non-empty blocks out of 224 blocks
16/09/24 18:14:15 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/09/24 18:14:15 INFO TaskSetManager: Finished task 220.0 in stage 4.0 (TID 892) in 13312 ms on localhost (221/224)
16/09/24 18:14:30 INFO Executor: Finished task 221.0 in stage 4.0 (TID 893). 316930 bytes result sent to driver
16/09/24 18:14:30 INFO TaskSetManager: Starting task 222.0 in stage 4.0 (TID 894, localhost, partition 222,NODE_LOCAL, 2042 bytes)
16/09/24 18:14:30 INFO Executor: Running task 222.0 in stage 4.0 (TID 894)
16/09/24 18:14:30 INFO ShuffleBlockFetcherIterator: Getting 224 non-empty blocks out of 224 blocks
16/09/24 18:14:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/09/24 18:14:30 INFO TaskSetManager: Finished task 221.0 in stage 4.0 (TID 893) in 14452 ms on localhost (222/224)
16/09/24 18:14:44 INFO Executor: Finished task 222.0 in stage 4.0 (TID 894). 346590 bytes result sent to driver
16/09/24 18:14:44 INFO TaskSetManager: Starting task 223.0 in stage 4.0 (TID 895, localhost, partition 223,NODE_LOCAL, 2042 bytes)
16/09/24 18:14:44 INFO Executor: Running task 223.0 in stage 4.0 (TID 895)
16/09/24 18:14:44 INFO ShuffleBlockFetcherIterator: Getting 224 non-empty blocks out of 224 blocks
16/09/24 18:14:44 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/09/24 18:14:44 INFO TaskSetManager: Finished task 222.0 in stage 4.0 (TID 894) in 14175 ms on localhost (223/224)
16/09/24 18:14:54 INFO Executor: Finished task 223.0 in stage 4.0 (TID 895). 280835 bytes result sent to driver
16/09/24 18:14:54 INFO TaskSetManager: Finished task 223.0 in stage 4.0 (TID 895) in 10179 ms on localhost (224/224)
16/09/24 18:14:54 INFO DAGScheduler: ResultStage 4 (collect at CappedRegionsPartitioner.scala:114) finished in 3214.235 s
16/09/24 18:14:54 INFO DAGScheduler: Job 1 finished: collect at CappedRegionsPartitioner.scala:114, took 6420.668128 s
16/09/24 18:14:54 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool 
16/09/24 18:15:00 INFO BlockManagerInfo: Removed broadcast_99_piece0 on localhost:59758 in memory (size: 3.6 KB, free: 2.3 GB)
16/09/24 18:15:00 INFO ContextCleaner: Cleaned accumulator 4
16/09/24 18:15:00 INFO ContextCleaner: Cleaned accumulator 3
16/09/24 18:15:09 INFO ContextCleaner: Cleaned shuffle 1
16/09/24 18:15:09 INFO ContextCleaner: Cleaned shuffle 0
16/09/24 18:15:09 INFO BlockManagerInfo: Removed broadcast_95_piece0 on localhost:59758 in memory (size: 1122.0 B, free: 2.3 GB)
--> [12107.89 sec. later]: Partitioned loci: 1049 partitions.
Partition-size stats:
num:    1049,   mean:   789966.1,   stddev: 405879.6,   mad:    272653
elems:  817449, 290329, 1280665, 1821549, 911489, 1014036, 48160, 1231108, 1023372, 1052595, …, 649050, 505449, 1048721, 351333, 639272, 313225, 1052318, 550549, 531763, 1606956
sorted: 4734, 5462, 5728, 6449, 11514, 13501, 14946, 16547, 17107, 19255, …, 1922236, 1924294, 1925612, 2028128, 2035102, 2052936, 2075179, 2153283, 2407485, 2536919
0.1:    5715.2
1:  26171.2
5:  156923.2
10: 313447.8
25: 492129
50: 757261
75: 1035206
90: 1332269.2
95: 1513677.4
99: 1890262.8
99.9:   2395283.3

Contigs-spanned-per-partition stats:
num:    1049,   mean:   1.1,    stddev: 1.1,    mad:    0
elems:  1×24, 3, 1×28, 2, 1×44, 2, 1×53, 2, 1×7, 2, …, 1×54, 4, 1×100, 2, 1×145, 7, 1×3, 3, 2, 1×2
sorted: 1×1019, 2×19, 3×5, 4, 5, 6, 7, 16, 31
0.1:    1
1:  1
5:  1
10: 1
25: 1
50: 1
75: 1
90: 1
95: 1
99: 2.5
99.9:   15.6
16/09/24 18:15:43 WARN BlockManager: Putting block broadcast_100 failed
16/09/24 18:15:43 INFO SparkUI: Stopped Spark web UI at http://192.168.2.85:4040
16/09/24 18:15:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/24 18:15:43 INFO MemoryStore: MemoryStore cleared
16/09/24 18:15:43 INFO BlockManager: BlockManager stopped
16/09/24 18:15:43 INFO BlockManagerMaster: BlockManagerMaster stopped
16/09/24 18:15:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/24 18:15:43 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/09/24 18:15:43 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
    at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
    at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:176)
    at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
    at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:223)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:223)
    at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:203)
    at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:70)
    at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
    at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
    at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
    at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:169)
    at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:147)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:798)
    at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:645)
    at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1003)
    at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
    at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
    at org.hammerlab.guacamole.readsets.rdd.PartitionedRegions$.compute(PartitionedRegions.scala:199)
    at org.hammerlab.guacamole.readsets.rdd.PartitionedRegions$.apply(PartitionedRegions.scala:184)
    at org.hammerlab.guacamole.readsets.rdd.PartitionedRegions$.apply(PartitionedRegions.scala:146)
    at org.hammerlab.guacamole.commands.SomaticJoint$.makeCalls(SomaticJointCaller.scala:166)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:91)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:56)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:12)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
16/09/24 18:15:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/09/24 18:15:43 INFO ShutdownHookManager: Shutdown hook called
16/09/24 18:15:43 INFO ShutdownHookManager: Deleting directory /mnt/hgfs/C/spark05-tmp/spark-c9cdd80e-6488-4335-96c3-d86a10226263
16/09/24 18:15:43 INFO ShutdownHookManager: Deleting directory /mnt/hgfs/C/spark05-tmp/spark-c9cdd80e-6488-4335-96c3-d86a10226263/httpd-33688602-2154-4815-ad63-dc66b9ba44ca

real    204m30.149s
user    258m56.571s
sys 21m56.325s
ryan-williams commented 7 years ago

That's interesting, thanks for all of the info.

Here are a few observations and recommendations:

Spills

As a rule, anytime you see "spill" messages, you need more memory; usually, you'd want more memory per task, and a common remedy would be to increase the number of Spark partitions in the spilling stages, but in this case the amounts of memory you're running with is below levels where a lot of Spark infrastructure makes sense, so the main recommendation is "use more memory!", which I will say in a few different ways below 😄 .

More Executors/Cores Require More Memory

The more cores you run with, the more memory you will need! There's a fixed memory cost to each logical "executor"/core that you use in your app.

For example, Guacamole moves the reference around by "broadcast"ing it (contig by contig, as needed) to each executor, which could easily mean that the first several GB of each executor are used just for storing that! So, if you go from 24 executors to 48 on 16GB of total memory, you have 24 new "executors" that need some or all of the reference, so it's not too surprising that the app blows up when making that jump!

I have done some work toward doing more efficient things with the reference that should mitigate this a bit, but for the time being that's how it works!

Noting this cores vs. memory-required relationship in your data above:

Those are some pretty good data points about this exact tradeoff, so thanks for providing that info!

Use More Memory

If you have 122GB available, i would use a fair amount more in order to avoid spills and OOMs!

For some perspective, a typical (small-ish) Guacamole run I would do might operate on a couple of 20GB BAMs (spread across perhaps ~500 128MB HDFS partitions), and accordingly using 500-2000 cores, spread across 125-500 4-core, 12GB executors, meaning ~3GB/core and 1-6TB total; I'd expect this to finish in 10-20mins with no spilling!

The reason I might use up to 4x as many cores (2000 above) as the number of initial HDFS partitions (and hence the number of partitions in the initial RDDs) is that some of the intermediate representations of the reads take up more space in memory than the compressed reads in BAM format on disk that we start with; Guacamole has some heuristics for letting the number of partitions in those intermediate stages expand elastically based on the number of reads it encounters.

So, I wouldn't really expect to get very far on a maximum-10GB machine/VM, though characterizing the exact kinds of failures you encounter there, and the (relatively trivial) sizes of BAMs you can run on in such an environment, is fairly interesting to me as a proxy/impetus for optimizing Guacamole as a whole.

On your 122GB machine, using my heuristics above (~3GB/core), I'd think you could run with 40 cores without any memory problems if you used all 122GB.

Of course, empirically you've been able to get away with even <1GB/core (16GB/24-cores, though that was probably spilling), so you could explore the space a little bit and it'd be interesting to hear what you find; for example, if I were you I'd try a 96GB/48-core run and think that'd have a reasonable probability of being reasonably performant and spill-free.

Judging by your 16GB/24-core success, with 48 cores you could work your way all the way down to 48GB or fewer, and see where you start spilling / degrading performance.

Thanks for the reports from the field and let me know if any of the above doesn't make sense or you get more data on what works and doesn't!

car2008 commented 7 years ago

Hi @ryan-williams ,the runtime is real 202m55.052s using "driver.memory20g, local [4] " , and that;s the fastest so far , the console printed:

Using Spark properties files: conf/local-multi,conf/driver-20g
spark.driver.memory 20g
spark.master local[4]
--> [Sun Sep 25 08:41:57 CST 2016]: Guacamole starting.
.
.
.
--> [11.09 sec. later]: Called 431,499 germline and 0 somatic variants.
--> [0.01 sec. later]: Writing 431,499 calls across 1 samples  to /home/bioinfo/zhipengcheng/file/result/out-local4-20g.vcf
--> [27.46 sec. later]: Done.
16/09/25 02:44:20 INFO SparkUI: Stopped Spark web UI at http://192.168.2.25:4040
16/09/25 02:44:20 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/25 02:44:22 INFO MemoryStore: MemoryStore cleared
16/09/25 02:44:22 INFO BlockManager: BlockManager stopped
16/09/25 02:44:22 INFO BlockManagerMaster: BlockManagerMaster stopped
16/09/25 02:44:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/25 02:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/09/25 02:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/09/25 02:44:22 INFO SparkContext: Successfully stopped SparkContext
16/09/25 02:44:22 INFO ShutdownHookManager: Shutdown hook called
16/09/25 02:44:22 INFO ShutdownHookManager: Deleting directory /home/bioinfo/zhipengcheng/file/tmp/spark-5dd6ca6a-c16e-4c01-bf18-779582b9c89c
16/09/25 02:44:22 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/09/25 02:44:22 INFO ShutdownHookManager: Deleting directory /home/bioinfo/zhipengcheng/file/tmp/spark-5dd6ca6a-c16e-4c01-bf18-779582b9c89c/httpd-41e6905b-ba6e-4e8e-ad23-75d83d676b73

real    202m55.052s
user    1187m3.005s
sys 14m6.541s
car2008 commented 7 years ago

Hi @ryan-williams , i have tested the different config to run the Guacamole . The results are : the runtime isreal 189m9.445s using "driver.memory30g, lcoal[4]" , real 168m44.524s using "driver.memory120g, lcoal[20]" , real 175m28.584s using "driver.memory120g, lcoal[24]" ,respectively. But it throwed exception using "driver.memory120g, lcoal[30]" :

2016-09-25 18:01:01 ERROR LiveListenerBus:74 - SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, localhost, 56338),broadcast_58_piece0,StorageLevel(false, true, false, false, 1),76356,0,0))
2016-09-25 18:01:01 ERROR LiveListenerBus:74 - SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, localhost, 56338),rdd_17_493,StorageLevel(false, true, false, true, 1),1437496,0,0))
2016-09-25 18:01:01 ERROR TaskSchedulerImpl:95 - Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$2@323b3b77 rejected from java.util.concurrent.ThreadPoolExecutor@4b464968[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1615]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:49)
    at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:354)
    at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:331)
    at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalBackend.scala:66)
    at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-09-25 18:01:04 ERROR LiveListenerBus:74 - SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, localhost, 56338),rdd_17_497,StorageLevel(false, true, false, true, 1),1236888,0,0))
2016-09-25 18:01:04 ERROR TaskSchedulerImpl:95 - Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$2@24e7ff19 rejected from java.util.concurrent.ThreadPoolExecutor@4b464968[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1615]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:49)
    at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:354)
    at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:331)
    at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalBackend.scala:66)
    at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-09-25 18:01:05 ERROR LiveListenerBus:74 - SparkListenerBus has already stopped! Dropping event SparkListenerBlockUpdated(BlockUpdatedInfo(BlockManagerId(driver, localhost, 56338),rdd_17_498,StorageLevel(false, true, false, true, 1),1587416,0,0))
2016-09-25 18:01:05 ERROR TaskSchedulerImpl:95 - Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$anon$2@4d70aecb rejected from java.util.concurrent.ThreadPoolExecutor@4b464968[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1615]
    at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
    at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
    at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
    at org.apache.spark.scheduler.TaskResultGetter.enqueueSuccessfulTask(TaskResultGetter.scala:49)
    at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:354)
    at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:331)
    at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalBackend.scala:66)
    at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
    at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
    at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
    at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 509 in stage 6.0 failed 1 times, most recent failure: Lost task 509.0 in stage 6.0 (TID 1629, localhost): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 182969 ms
Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
    at org.apache.spark.rdd.RDD.count(RDD.scala:1157)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:106)
    at org.hammerlab.guacamole.commands.SomaticJoint$Caller$.run(SomaticJointCaller.scala:56)
    at org.hammerlab.guacamole.commands.SparkCommand.run(SparkCommand.scala:12)
    at org.hammerlab.guacamole.commands.Command.run(Command.scala:27)
    at org.hammerlab.guacamole.Main$.main(Main.scala:49)
    at org.hammerlab.guacamole.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

real    121m35.882s
user    2222m53.298s
sys 66m30.893s
ryan-williams commented 7 years ago

Your last stack trace, starting with SparkListenerBus has already stopped!, almost certainly has causative exceptions higher in the stdout.

Please put the entire stdout of these runs, and/or event-log JSON files, somewhere (e.g. gist.github.com, not here) where I can see them if you want me to help you debug them.

I'm assuming your lcoal[…] above were actually run as local[…].

It's interesting to note that you seem to have gotten a 2x speed up with 4-cores just solely by using more memory, presumably due to stopping spilling (and reducing GC?). That's consistent with what I'd expect.

car2008 commented 7 years ago

Hi @ryan-williams ,these results i have tested are listed below:

Firstly , In the past I always thought the physical CPU may run more efficient than the logical CPU ignoring the difference of core numbers. But now it seems that the logical one can also run as well as the physical CPU. Secondly, to the same local thread , more memory doesn't necessarily mean running faster, there is a threshold value ( about 2~4g/core ) between memory and cores. It's the same conclusion to the same memory. Thirdly, more cores(48 cores) need more memory if not it couldn't even work normally. At last ,the shortest processing time is 125m9.310s using "driver.memory 96g, local[48]" so far. Sorry for another question ,i don't konw why i can't open the gist.github.com in browser , and in which directory can i find the stdout file in local spark mode? Is "10~20min" the shortest processing time when you run in yarn-cluster mode or/and local mode using 6T memory and 2000 physical cores?What's the difference between driver-memory and executor-memory and why we just set the driver-memory in spark local mode? I'm a little confused about them.

ryan-williams commented 7 years ago

Thanks for those results, sorry for the delay in responding.

it seems that the logical one can also run as well as the physical CPU.

Cool! I don't have much insight into that so that's good to know.

i can't open the gist.github.com in browser

That's pretty strange; it is publicly accessible on the internet afaik… what happens when you go to that URL?

in which directory can i find the stdout file in local spark mode?

I just mean all the stdout that you see when running a Spark app, starting from the bash command that kicks off the run, and ending with the statistics output by time at the bottom. Your previous stack traces were only showing pieces of that, and confusingly Spark tends to output many many exceptions when it is crashing / shutting down, so you sometimes have to look around for a bit to understand which ones are signal vs. noise.

Is "10~20min" the shortest processing time when you run in yarn-cluster mode or/and local mode using 6T memory and 2000 physical cores?

I'm running some benchmarks now with far fewer resources than that, so that we can compare things more directly here (and in general I'm working on bringing all of these numbers down!). I have two BAMs I've been running the somatic-standard caller against on my cluster, with sizes 19GB and 21GB (so 40GB / 200MM reads total).

I ran on them with a 10GB driver and different numbers of 12GB, 4-core executors to work my way down to reasonable sizes and timings that should be able to have you emulate:

The command I used for all of them was of the form:

GUAC_SPARK_CONFS=… scripts/guacamole somatic-standard --normal-reads $n --tumor-reads $t --reference $hg19 --max-reads-per-partition 500000 --out $out.vcf

and here are the Spark configs from the confs files I used (output at the beginning of the Guacamole run).

So I think this paints a pretty clear picture that the total amounts of resources you have available (48 cores, 128GB), when spread around in cluster mode, are enough to get good performance; the above is also all run on ~5x as much input-BAM data (40GB) than the one 7GB BAM you mentioned that you were running on, IIRC?

So we'll just have to figure out what might be causing your local runs to degrade relative to the cluster runs above. In theory, the cluster runs should be the ones with higher overhead and fixed costs, so I'm confident we can improve your timings with a little more digging.

One major thing I realized you should probably try is switching over to the somatic-standard caller; the somatic-joint caller does some inefficient stuff at the end where it collects a lot of variants (and a lot of metadata about each variant) to the driver; we never really finished implementing that part in a sane way, and I can see in some of your excerpts above that you are collecting many millions of variants to the driver. That could easily be a big cause of the performance problems you're seeing; again, full logs of your runs will be necessary to diagnose that.

What's the difference between driver-memory and executor-memory and why we just set the driver-memory in spark local mode?

I'm not super familiar with the internals of spark local-mode, but I think the concept of an executor isn't really meaningful in local mode; this SO corroborates that. I think you basically have one memory-knob to turn, and that's spark.driver.memory aka --driver-memory.

I hope that helps! I'm confident we can still bring your runtimes down a lot, so let me know how this all sits with you.

Also, are you able/interested to share your example BAMs with me privately? I don't think it will be necessary, but am just curious, we're always interested in collecting data for internal benchmarking.

car2008 commented 7 years ago

Hi @ryan-williams , thank you very much for your advices and very sorry for the delay so long in responding. I have tried the command ./scripts/guacamole somatic-standard --normal-reads /home/bioinfo/zhipengcheng/file/file2/dedup.bam --tumor-reads /home/bioinfo/zhipengcheng/file/file2/10298.dedup.bam --reference /home/bioinfo/zhipengcheng/file/ucsc.hg19.fasta --out /home/bioinfo/zhipengcheng/file/result/out-standard-24-96g-1.vcf .The size of two BAM files i using is 5g and 5.1g , respectively. The runtime is real 121m38.934s using 24 cores ,96g memory in spark-local mode.The full log here. The runtime is real 120m21.917s using 24 cores ,72g memory in spark-local mode.The full log here. The runtime is real 157m16.150s using 24 cores ,48g memory in spark-local mode.The full log here. But the runtime didn't reduced obviously using somatic-standardcaller in spark-local mode. It throw exception when using 48 cores ,96g memory in spark-local mode.The full log is here. Another question : Can i process only one BAM file like ./scripts/guacamole somatic-joint /home/bioinfo/zhipengcheng/file/dedup.bam --reference /home/bioinfo/zhipengcheng/file/ucsc.hg19.fasta --out /home/bioinfo/zhipengcheng/file/result/out.vcf when using somatic-standardcaller? I have tried ./scripts/guacamole somatic-standard /home/bioinfo/zhipengcheng/file/file2/10298.dedup.bam --reference /home/bioinfo/zhipengcheng/file/ucsc.hg19.fasta --out /home/bioinfo/zhipengcheng/file/result/out-standard-32-128g.vcf but failed.It throws exception No argument is allowed: /home/bioinfo/zhipengcheng/file/file2/10298.dedup.bam