Too many jobs created if reference genome has a large number of contigs

J-Wall commented 3 years ago

Hi, I am facing the following problem:

I am using grenepipe to do variant calling on a few dozen samples (around 80, whole-genome re-sequenced) of a non-model organism. The reference genome assembly I am working with has a large number (>4000) small fragmented contigs. Consequently, when the DAG has to be re-built after the samtools_faidx checkpoint, snakemake really struggles, as a few hundred-thousand jobs are created. This causes "Updating job merge_variants." to take a very long time, and causes problems for slurm queues etc. which may impose limits on the number of jobs submitted concurrently.

In my situation, I could reduce the reference genome to just the core set of chromosome-level contigs we have, although I don't really want to do this, as there may be important variation in the un-scaffolded contigs. Other non-model genome assemblies might only have small contigs...

My suggested solution is to add an option to pool contigs (ie. pass multiple contigs to regions input of call_variants) into fewer jobs. This is related to the existing restrict-regions option (and perhaps has to be mutually exclusive, but I am not sure). I am experimenting with something along these lines (https://github.com/J-Wall/grenepipe, currently only using haplotypecaller), although my solution is a little bit hacky at the moment. If it's a feature you would be interested in adding, I might be interested to contribute.

Once again, thanks for the excellent pipeline!

lczech commented 3 years ago

Hi @J-Wall,

that exact same issue recently came up with a collaborator as well, and so I have already thought about adding it. So, great idea!

Back then, I decided to not implement it immediately, as I had many other things on my plate, and the collaborator was working with only 400 contigs, so it was still manageable. But your case is even more extreme, and so yes, I can see the usefulness of this!

I'd be happy to work on this at some point, and sure, if you already have an implementation that does something in that direction, let's continue working on this, or maybe you can even prepare a PR for this if you want :-)

Cheers Lucas

lczech commented 3 years ago

Hi @J-Wall,

finally got to work on this. In the latest set of commits, I've added the functionality that you proposed to grenepipe.

My first draft was based on your idea, thank you very much for this! I did not have a pull request from you to work off, but mentioned you in the commit. I hope that this works as recognition :-)

Then, I started refining your solution. It now uses a simple (greedy) bin packing solver to achieve a slightly more optimal distribution of contigs per group, allows to use the other calling tools as well, and has generally a "less hacky" snakemake setup. Maybe that is also interesting for you - you can back-port it to your fork, or just use the main repo instead, as this should solve your problem as well (and has some more bug fixes by now that are not part of your fork).

Cheers, and thanks again for both the suggestion and your initial solution! Lucas

PS: Closing this now. Feel free to re-open should there be any issues with the new code.

moiexpositoalonsolab / grenepipe

Too many jobs created if reference genome has a large number of contigs #8