malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Added Dockerfile and single-task WDL for baseline amplicon SNP calling parasite pipeline. #40

Closed samuelklee closed 4 years ago

samuelklee commented 4 years ago

Very basic port of Stage 1, Step 3 of the amplicon SNP calling parasite pipeline. Adds a Dockerfile (for a conda environment that contains all necessary tools), a single-task WDL, and a test JSON.

Note that the test JSON points to a CRAM and a VCF (containing targets for collecting pileups) which have been "lifted over"---i.e., in contrast to files produced by the original pipeline, which performed alignment to a reference containing separate contigs for each panel target, these files are with respect to the full reference.

In putting this together, I discovered other issues in upstream steps that may need to be corrected (in addition to aligning against the full reference) before we can run GATK/Picard-based comparisons against this baseline. For example, Step 2 does not produce a valid CRAM file according to Picard ValidateSamFile. There are also some issues with unmapped/split reads that I'd like to understand.

We can fix up things and/or reorganize things as we move along, this is just to get us started.

Closes #37.

samuelklee commented 4 years ago

Just noting, after today's discussion we can probably move forward by just trying to relax downstream validation stringency (at least for now---the invalid reads may actually cause some tools to fail, in which case we do need to go back and reexamine demultiplexing/alignment). But I will continue investigation of the issue with the unmapped/split reads to make sure it is indeed a separate one. So the next step will be to run M2 out of the box.

samuelklee commented 4 years ago

Thanks @fleharty! @JonKeatley112 any comments?

fleharty commented 4 years ago

@samuelklee When you run this, do you use the methods cromwell server?

samuelklee commented 4 years ago

I set up a Terra workspace (broad-firecloud-dsde/malariagen-dev) just to double check that it ran on the cloud successfully. It only takes a few minutes to run either there or locally.

fleharty commented 4 years ago

@samuelklee Could you provide a link to the Terra workspace (oh never mind, you did...)

Is there a reason this isn't merged yet?

samuelklee commented 4 years ago

Was waiting to see if @JonKeatley112 wanted to comment, but I’ll go ahead and merge.

fleharty commented 4 years ago

@samuelklee Could you share the Terra workspace broad-firecloud-dsde/malariagen-dev with me, I can't seem to access it.