hammerlab / guacamole

Spark-based variant calling, with experimental support for multi-sample somatic calling (including RNA) and local assembly
Apache License 2.0
83 stars 21 forks source link

when ever am passing a bam file as input am getting nothing out of this. #375

Open ankushreddy opened 8 years ago

ankushreddy commented 8 years ago

Hi team,

When am submitting the job it is getting executed successfully but it is not calling any genotypes.

could you please help me with this issue.

       spark-submit --master yarn --deploy-mode client --driver-java-options -Dlog4j.configuration=/local/guacamole/scripts/logs4j.properties --executor-memory 4g --driver-memory 10g --num-executors 20 --executor-cores 10 --class org.hammerlab.guacamole.Guacamole --verbose /local/guacamole/target/guacamole-with-dependencies-0.0.1-SNAPSHOT.jar germline-threshold --reads hdfs:///shared/avocado_test/NA06984.454.MOSAIK.SRP000033.2009_11.bam --out hdfs:///user/asugured/guacamole/result.vcf

please see the output of the spark-submit I have used.

16/01/27 16:14:02 INFO YarnScheduler: Adding task set 19.0 with 1 tasks 16/01/27 16:14:02 INFO TaskSetManager: Starting task 0.0 in stage 19.0 (TID 14, istb1-l2-b12-07.hadoop.priv, PROCESS_LOCAL, 1432 bytes) 16/01/27 16:14:02 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on istb1-l2-b12-07.hadoop.priv:38654 (size: 1808.0 B, free: 2.1 GB) 16/01/27 16:14:02 INFO DAGScheduler: Stage 19 (count at VariationRDDFunctions.scala:144) finished in 0.101 s 16/01/27 16:14:02 INFO TaskSetManager: Finished task 0.0 in stage 19.0 (TID 14) in 88 ms on istb1-l2-b12-07.hadoop.priv (1/1) 16/01/27 16:14:02 INFO YarnScheduler: Removed TaskSet 19.0, whose tasks have all completed, from pool 16/01/27 16:14:02 INFO DAGScheduler: Job 5 finished: count at VariationRDDFunctions.scala:144, took 0.115971 s 16/01/27 16:14:02 INFO VariantContextRDDFunctions: Write 0 records 16/01/27 16:14:02 INFO MapPartitionsRDD: Removing RDD 22 from persistence list 16/01/27 16:14:02 INFO BlockManager: Removing RDD 22 * Delayed Messages * Called 0 genotypes. Region counts: filtered 0 total regions to 0 relevant regions, expanded for overlaps by NaN% to 0 Regions per task: min=NaN 25%=NaN median=NaN (mean=NaN) 75%=NaN max=NaN. Max is NaN% more than mean. 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/01/27 16:14:03 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}

Thanks & Regards, Ankush Reddy

arahuja commented 8 years ago

Hi @ankushreddy - thanks for checking out Guacamole. Most of the callers are still in progress, but hopefully we can help you test them out. For germline-threshold, there is also a --threshold parameter to be aware of which is lowest VAF necessary to call a variant. It's unlikely you are hitting this, but just wanted to make you aware.

Does your BAM file have MDTags? If not, germline-threshold currently requires the reads already has mdtags, otherwise unfortunately, removes them. You can add mdtags through samtools or adam.

We do support computing of mdtags in Guacamole as well, but looking over the code, this isn't configured correctly right now. I wll file an issue for that and fix it.

ankushreddy commented 8 years ago

Hi @arahuja Thanks for the quick reply I just got few more questions am actually new to genomics. So don't know what exactly is going on with the code.

**I used adam-submit and added the tags by using this adam-submit command.

./adam-submit transform /shared/avocado_test/NA06984.454.ssaha.SRP000033.2009_10.bam /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam -add_md_tags /shared/avocado_test/human_b36_male.fa ** Then I got three outputs.

drwxr-xr-x - asugured hdfs 0 2016-01-28 12:37 /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam -rw-r--r-- 3 asugured hdfs 1350 2016-01-28 12:32 /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam.rgdict -rw-r--r-- 3 asugured hdfs 4513 2016-01-28 12:32 /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam.seqdict

Later I used /shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam to submit it with spark-submit for guacamole. Please find the submit command and error message. spark-submit --master yarn --deploy-mode client --driver-java-options -Dlog4j.configuration=/local/guacamole/scripts/logs4j.properties --executor-memory 4g --driver-memory 10g --num-executors 20 --executor-cores 10 --class org.hammerlab.guacamole.Guacamole --verbose /local/guacamole/target/guacamole-with-dependencies-0.0.1-SNAPSHOT.jar germline-threshold --reads hdfs:///shared/avocado_out/NA06984.454.ssaha.SRP000033.2009_10.bam_tags.adam --out hdfs:///user/asugured/guacamole/result2.vcf

Error it is throwing avro parquet schema error.

16/01/28 12:46:07 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, istb1-l2-b11-01.hadoop.priv): org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found. at org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:128) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:89) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:64) at org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34) at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:138) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:201) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

16/01/28 12:46:07 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, istb1-l2-b11-01.hadoop.priv, NODE_LOCAL, 1516 bytes) 16/01/28 12:46:07 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor istb1-l2-b11-01.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 1] 16/01/28 12:46:07 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 3, istb1-l2-b13-05.hadoop.priv, NODE_LOCAL, 1516 bytes) 16/01/28 12:46:07 INFO TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 2] 16/01/28 12:46:07 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 4, istb1-l2-b12-09.hadoop.priv, NODE_LOCAL, 1516 bytes) 16/01/28 12:46:08 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 4) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 3] 16/01/28 12:46:08 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 5, istb1-l2-b12-09.hadoop.priv, NODE_LOCAL, 1516 bytes) 16/01/28 12:46:08 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 5) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 4] 16/01/28 12:46:08 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 6, istb1-l2-b12-09.hadoop.priv, NODE_LOCAL, 1516 bytes) 16/01/28 12:46:08 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on executor istb1-l2-b12-09.hadoop.priv: org.apache.parquet.io.InvalidRecordException (Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found.) [duplicate 5] 16/01/28 12:46:08 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 16/01/28 12:46:08 INFO YarnScheduler: Cancelling stage 0 16/01/28 12:46:08 INFO YarnScheduler: Stage 0 was cancelled 16/01/28 12:46:08 INFO DAGScheduler: Job 0 failed: reduce at ADAMRDDFunctions.scala:127, took 4.545263 s 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 16/01/28 12:46:08 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 16/01/28 12:46:08 INFO SparkUI: Stopped Spark web UI at http://istb1-l2-b13-u35.hadoop.priv:4042 16/01/28 12:46:08 INFO DAGScheduler: Stopping DAGScheduler 16/01/28 12:46:08 INFO YarnClientSchedulerBackend: Shutting down all executors 16/01/28 12:46:08 INFO YarnClientSchedulerBackend: Asking each executor to shut down 16/01/28 12:46:08 INFO YarnClientSchedulerBackend: Stopped 16/01/28 12:46:08 INFO OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped! 16/01/28 12:46:08 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 16/01/28 12:46:08 INFO MemoryStore: MemoryStore cleared 16/01/28 12:46:08 INFO BlockManager: BlockManager stopped 16/01/28 12:46:08 INFO BlockManagerMaster: BlockManagerMaster stopped 16/01/28 12:46:08 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, istb1-l2-b12-09.hadoop.priv): org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch. Avro field 'recordGroupPredictedMedianInsertSize' not found. at org.apache.parquet.avro.AvroIndexedRecordConverter.getAvroField(AvroIndexedRecordConverter.java:128) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:89) at org.apache.parquet.avro.AvroIndexedRecordConverter.(AvroIndexedRecordConverter.java:64) at org.apache.parquet.avro.AvroCompatRecordMaterializer.(AvroCompatRecordMaterializer.java:34) at org.apache.parquet.avro.AvroReadSupport.newCompatMaterializer(AvroReadSupport.java:138) at org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:130) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:179) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:201) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:133) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Jan 28, 2016 12:46:03 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 2 16/01/28 12:46:08 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/01/28 12:46:08 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/01/28 12:46:08 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

Could you please help me in understanding this.

Thanks & Regards, Ankush Reddy.

arahuja commented 8 years ago

What version of ADAM are you using, if you have moved to a new version of ADAM the schema of the ADAM format may be different. If you can use the BAM output of ADAM that may work better, but I have not used that before.

ankushreddy commented 8 years ago

I am adam latest version just now I cloned it from git and started using it. when we use transform in adam submit it will adam format or something parquet format of data.

please correct me if am following the correct process.

ryan-williams commented 8 years ago

Hey @ankushreddy HEAD of ADAM has different schemas than guacamole expects; we depend on ADAM 0.18.1 and they've been doing big refactorings recently.

If you can try those steps again using that version of ADAM, guacamole should be able to read the .adam files correctly.

Or, as @arahuja said, you can try using .bam as your intermediate format instead of .adam.

ankushreddy commented 8 years ago

@ryan-williams hi ryan thanks for guiding me I used adam 0.18.1

am getting null pointer exception. please find the LOG.

16/01/28 20:05:38 INFO MemoryStore: ensureFreeSpace(303352) called with curMem=0, maxMem=5556991426 16/01/28 20:05:38 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 296.2 KB, free 5.2 GB) 16/01/28 20:05:38 INFO MemoryStore: ensureFreeSpace(27127) called with curMem=303352, maxMem=5556991426 16/01/28 20:05:38 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 26.5 KB, free 5.2 GB) 16/01/28 20:05:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.107.18.34:36737 (size: 26.5 KB, free: 5.2 GB) 16/01/28 20:05:38 INFO SparkContext: Created broadcast 0 from newAPIHadoopFile at ADAMContext.scala:158 16/01/28 20:05:39 INFO FileInputFormat: Total input paths to process : 2 16/01/28 20:05:39 INFO SparkContext: Starting job: reduce at ADAMRDDFunctions.scala:127 16/01/28 20:05:39 INFO DAGScheduler: Got job 0 (reduce at ADAMRDDFunctions.scala:127) with 2 output partitions (allowLocal=false) 16/01/28 20:05:39 INFO DAGScheduler: Final stage: ResultStage 0(reduce at ADAMRDDFunctions.scala:127) 16/01/28 20:05:39 INFO DAGScheduler: Parents of final stage: List() 16/01/28 20:05:39 INFO DAGScheduler: Missing parents: List() 16/01/28 20:05:39 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at mapPartitions at ADAMRDDFunctions.scala:126), which has no missing parents 16/01/28 20:05:39 INFO MemoryStore: ensureFreeSpace(3496) called with curMem=330479, maxMem=5556991426 16/01/28 20:05:39 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.4 KB, free 5.2 GB) 16/01/28 20:05:39 INFO MemoryStore: ensureFreeSpace(1965) called with curMem=333975, maxMem=5556991426 16/01/28 20:05:39 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1965.0 B, free 5.2 GB) 16/01/28 20:05:39 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.107.18.34:36737 (size: 1965.0 B, free: 5.2 GB) 16/01/28 20:05:39 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874 16/01/28 20:05:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at mapPartitions at ADAMRDDFunctions.scala:126) 16/01/28 20:05:39 INFO YarnScheduler: Adding task set 0.0 with 2 tasks 16/01/28 20:05:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, istb1-l2-b14-09.hadoop.priv, NODE_LOCAL, 1589 bytes) 16/01/28 20:05:39 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, istb1-l2-b14-19.hadoop.priv, NODE_LOCAL, 1590 bytes) 16/01/28 20:05:40 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on istb1-l2-b14-19.hadoop.priv:54486 (size: 1965.0 B, free: 2.1 GB) 16/01/28 20:05:40 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on istb1-l2-b14-09.hadoop.priv:49118 (size: 1965.0 B, free: 2.1 GB) 16/01/28 20:05:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on istb1-l2-b14-19.hadoop.priv:54486 (size: 26.5 KB, free: 2.1 GB) 16/01/28 20:05:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on istb1-l2-b14-09.hadoop.priv:49118 (size: 26.5 KB, free: 2.1 GB) 16/01/28 20:05:46 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, istb1-l2-b14-19.hadoop.priv): java.lang.NullPointerException at org.bdgenomics.adam.models.SequenceRecord$.fromADAMContig(SequenceDictionary.scala:268) at org.bdgenomics.adam.models.SequenceRecord$.fromSpecificRecord(SequenceDictionary.scala:325) at org.bdgenomics.adam.rdd.ADAMSpecificRecordSequenceDictionaryRDDAggregator.getSequenceRecordsFromElement(ADAMRDDFunctions.scala:153) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$mergeRecords$1(ADAMRDDFunctions.scala:108) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

16/01/28 20:05:46 INFO TaskSetManager: Starting task 1.1 in stage 0.0 (TID 2, istb1-l2-b14-19.hadoop.priv, NODE_LOCAL, 1590 bytes) 16/01/28 20:05:51 INFO TaskSetManager: Lost task 1.1 in stage 0.0 (TID 2) on executor istb1-l2-b14-19.hadoop.priv: java.lang.NullPointerException (null) [duplicate 1] 16/01/28 20:05:51 INFO TaskSetManager: Starting task 1.2 in stage 0.0 (TID 3, istb1-l2-b14-07.hadoop.priv, NODE_LOCAL, 1590 bytes) 16/01/28 20:05:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 12732 ms on istb1-l2-b14-09.hadoop.priv (1/2) 16/01/28 20:05:53 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on istb1-l2-b14-07.hadoop.priv:56257 (size: 1965.0 B, free: 2.1 GB) 16/01/28 20:05:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on istb1-l2-b14-07.hadoop.priv:56257 (size: 26.5 KB, free: 2.1 GB) 16/01/28 20:06:00 WARN TaskSetManager: Lost task 1.2 in stage 0.0 (TID 3, istb1-l2-b14-07.hadoop.priv): java.lang.NullPointerException at org.bdgenomics.adam.models.SequenceRecord$.fromADAMContig(SequenceDictionary.scala:268) at org.bdgenomics.adam.models.SequenceRecord$.fromSpecificRecord(SequenceDictionary.scala:325) at org.bdgenomics.adam.rdd.ADAMSpecificRecordSequenceDictionaryRDDAggregator.getSequenceRecordsFromElement(ADAMRDDFunctions.scala:153) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$mergeRecords$1(ADAMRDDFunctions.scala:108) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

16/01/28 20:06:00 INFO TaskSetManager: Starting task 1.3 in stage 0.0 (TID 4, istb1-l2-b14-07.hadoop.priv, NODE_LOCAL, 1590 bytes) 16/01/28 20:06:04 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 4) on executor istb1-l2-b14-07.hadoop.priv: java.lang.NullPointerException (null) [duplicate 1] 16/01/28 20:06:04 ERROR TaskSetManager: Task 1 in stage 0.0 failed 4 times; aborting job 16/01/28 20:06:04 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/01/28 20:06:04 INFO YarnScheduler: Cancelling stage 0 16/01/28 20:06:04 INFO DAGScheduler: ResultStage 0 (reduce at ADAMRDDFunctions.scala:127) failed in 25.492 s 16/01/28 20:06:04 INFO DAGScheduler: Job 0 failed: reduce at ADAMRDDFunctions.scala:127, took 25.594264 s 16/01/28 20:06:04 INFO SparkUI: Stopped Spark web UI at http://10.107.18.34:4041 16/01/28 20:06:04 INFO DAGScheduler: Stopping DAGScheduler 16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Shutting down all executors 16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Interrupting monitor thread 16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Asking each executor to shut down 16/01/28 20:06:04 INFO YarnClientSchedulerBackend: Stopped 16/01/28 20:06:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/01/28 20:06:04 INFO Utils: path = /tmp/spark-86a917a9-9a0e-4d2e-9b6c-c0492c8a5ddc/blockmgr-b310de96-f340-4640-bdd8-6fa3c5d0e409, already present as root for deletion. 16/01/28 20:06:04 INFO MemoryStore: MemoryStore cleared 16/01/28 20:06:04 INFO BlockManager: BlockManager stopped 16/01/28 20:06:04 INFO BlockManagerMaster: BlockManagerMaster stopped 16/01/28 20:06:04 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 4, istb1-l2-b14-07.hadoop.priv): java.lang.NullPointerException at org.bdgenomics.adam.models.SequenceRecord$.fromADAMContig(SequenceDictionary.scala:268) at org.bdgenomics.adam.models.SequenceRecord$.fromSpecificRecord(SequenceDictionary.scala:325) at org.bdgenomics.adam.rdd.ADAMSpecificRecordSequenceDictionaryRDDAggregator.getSequenceRecordsFromElement(ADAMRDDFunctions.scala:153) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$mergeRecords$1(ADAMRDDFunctions.scala:108) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$2.apply(ADAMRDDFunctions.scala:120) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator.org$bdgenomics$adam$rdd$ADAMSequenceDictionaryRDDAggregator$$foldIterator$1(ADAMRDDFunctions.scala:120) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126) at org.bdgenomics.adam.rdd.ADAMSequenceDictionaryRDDAggregator$$anonfun$3.apply(ADAMRDDFunctions.scala:126) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 16/01/28 20:06:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! Jan 28, 2016 8:05:39 PM INFO: org.apache.parquet.hadoop.ParquetInputFormat: Total input paths to process : 2 16/01/28 20:06:04 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 16/01/28 20:06:04 INFO Utils: Shutdown hook called 16/01/28 20:06:04 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 16/01/28 20:06:04 INFO Utils: Deleting directory /tmp/spark-86a917a9-9a0e-4d2e-9b6c-c0492c8a5ddc

Any kind of help is appreciated.

Thanks & Regards, Ankush Reddy

ryan-williams commented 8 years ago

Your NPE is coming from this line; some contig is null. I'm not sure why that would happen, at a glance.

ankushreddy commented 8 years ago

hi @ryan-williams I have tried with different .bam files but am facing the same issue. Could you please let me know when guacamole is ready to handle both .bam and fasta reference file.

Thanks & Regards, Ankush Reddy.

arahuja commented 8 years ago

Hi @ankushreddy

Can you tell us more about the error you hit using a BAM file. As we mentioned earlier if the ADAM format is different than that support we say issues, but that part of the code has seen little use/testing as well. We recently upgraded our ADAM input if you wanted to retest.

However, if you are seeing a similar error with a BAM input that would be good to know about

ankushreddy commented 8 years ago

Hi @arahuja just want to check what is the current version of adam I should use or is it enough if I just use a bam or sam file that is aligned with the reference I will test it once again and let you know the results.

arahuja commented 8 years ago

@ankushreddy We actually now only support loading the reference explicitly and do not rely on md-tags anymore. Also, we aren't really supporting germline-threshold if that is what you are using, and it will likely be removed. We have updated the README for new sample commands to try.

ankushreddy commented 8 years ago

@arahuja Thanks for the reply am actually testing it on the sam file but i see lot of variants are being called is there any way we can minimize the variants based on quality or anything.