Support running benchmarks on mutect + strelka

hammerlab / variant-calling-benchmarks

Automated and curated variant calling benchmarks for Guacamole

Apache License 2.0

2 stars 1 forks source link

Support running benchmarks on mutect + strelka #28

Open timodonnell opened 8 years ago

timodonnell commented 8 years ago

For the Guacamole paper we'll need to compare Guacamole against mutect, strelka, and other callers if possible (e.g. mutect2) using the benchmarks in this repo. This could also potentially be a basis for the PICI effort to get reference datasets and analyses for somatic variant calling in place.

Probably we should have a command like vcb-standard-callers that launches a ketrew workflow for these callers, similarly to the existing commands which launch Guacamole. Can also have a separate command (vcb-mutect) for each caller if that is easier. Ideally it will be possible to run on demeter or on google cloud. Hopefully can use essentially what we already have in epidisco.

@smondet @ihodes @armish or anyone else interested in tackling this?

smondet commented 8 years ago

@timodonnell I will add more (optional) variant callers to the future epidisco. We can also add Guacamole, and results could show up on the same webpage, but dealing with the infrastructure-setup part is a bit harder, so it may take a while.

timodonnell commented 8 years ago

Cool, it would be fine if the infrastructure setup part is done later. If we could just run these benchmarks through epidisco on demeter (or perhaps on an already configured google cloud infrastructure), that would be great.

timodonnell commented 7 years ago

Thinking a little more about how we might get this done soon. I can add the command here that will call epidisco, and it's ok if the pipeline runs on the local machine and is not parallelized. What I would just need is a script that sets up the ketrew environment and then launches epidisco using the specified BAMs and writing VCF(s) to the given output paths. If I need to write an epidisco input JSON file with a given format pointing to the files, that's fine too. But we would have to use the actual BAMs passed, not realign. And the ketrew setup would ideally be short-lived and in a script and not a long-running thing we setup manually. How possible / reasonable does that seem?

smondet commented 7 years ago

I think what you want there is a new fresh biokepi pipeline. Which easy to do (I think we already had one somewhere that runs all variant callers on a pair of bams).

Running a Ketrew docker locally is also easy.

Then I don't think the tools themselves can run on a single machine unless the data is very small. Mutect2, even parallelized, can take about 20 hours on a single chromosome, and the memory grows quite a lot. Also by default a Ketrew pipelines will maximize parallelization among tools, so to leave all the memory available for each tool, we would need to hack fake dependencies among tools to force them to run sequentially. If it happens to be able to run successfully, he whole thing would take week(s).

timodonnell commented 7 years ago

I see, thanks for the info @smondet . Sounds like running on demeter or google cloud is the way to go. Would I be launching the workflow by running a docker container and passing in the paths to the BAMs? That works for me. Even if we can only run mutect at first that would be a great start.