Implement spark support for our framework

magicDGS commented 7 years ago

Issue for all ideas related with this

We can implement a Spark framework for our walkers/framework. For this, we require:

Reads source to get the reads for any kind of read tools inputs. For BAM/CRAM/SAM files we should use the same approach as GATK. We need to figure out how to process the reads in Spark from FASTQ files or other formats already implemented at the moment of start with this.
Spark version of ReadToolsWalker extending SparkCommandLineProgram, inspired by the GATKSparkTool implementation.
Implement spark pipelines tools: for instance, trimming -> barcode assignment -> bwa mem mapping. The latest part requires the JNI from GATK. This will simplify our pipelines and be run faster if we set up a Spark cluster.

All these ideas requires their own issue eventually.

magicDGS commented 7 years ago

Update

For implement reads traversal, we require:

SAM/BAM/CRAM: have a look to the code in GATK, which have a reads source for spark. We can use directly that one, which also have methods to keep in the same partition read pairs.
FASTQ: in Hadoop-BAM there is an implementation of InputFormat for FASTQ, but it does not look too flexible. There is a new paper describing a library with InputFormat for HDFS: FASTdoop. This may be useful, but it does not have a repository nor a maven artifact, so I will write them an email to be able to use it. Still the question of how to process FASTQ split files...

magicDGS commented 7 years ago

There is also an implementation for the FASTQ input format in the ADAM project. Concretely, there are implementation for both single-end files and interleaved pair-end for the org.bdgenomics.adam.io.FastqRecordReader. There are limitations about the compression in the interleaved format, and maybe we can use a workaround for the single-end files to support also split pair-end.

magicDGS / ReadTools

Implement spark support for our framework #110

Issue for all ideas related with this

Update