Open magicDGS opened 7 years ago
For implement reads traversal, we require:
InputFormat
for FASTQ, but it does not look too flexible. There is a new paper describing a library with InputFormat
for HDFS: FASTdoop. This may be useful, but it does not have a repository nor a maven artifact, so I will write them an email to be able to use it. Still the question of how to process FASTQ split files...There is also an implementation for the FASTQ input format in the ADAM project. Concretely, there are implementation for both single-end files and interleaved pair-end for the org.bdgenomics.adam.io.FastqRecordReader
. There are limitations about the compression in the interleaved format, and maybe we can use a workaround for the single-end files to support also split pair-end.
Issue for all ideas related with this
We can implement a Spark framework for our walkers/framework. For this, we require:
ReadToolsWalker
extendingSparkCommandLineProgram
, inspired by theGATKSparkTool
implementation.trimming -> barcode assignment -> bwa mem mapping
. The latest part requires the JNI from GATK. This will simplify our pipelines and be run faster if we set up a Spark cluster.All these ideas requires their own issue eventually.