moiexpositoalonsolab / grenepipe

A flexible, scalable, and reproducible pipeline to automate variant calling from raw sequence reads, with lots of bells and whistles.
http://grene-net.org
GNU General Public License v3.0
93 stars 21 forks source link

merging calls from multiple pipeline runs? #39

Closed RvV1979 closed 10 months ago

RvV1979 commented 10 months ago

Hi Lucas,

Me and my colleagues and students are avid users of the grenepipe pipeline as it offers great flexibility. Meanwhile, we have various directories from which the pipeline was run, and we would now like to run a larger, combined analysis. Is there a way to merge, e.g., the called directories from different runs and perform the final (genotyping and filtering) rules on the combined set?

Thanks

lczech commented 10 months ago

Hi @RvV1979,

thanks for the nice feedback, good to hear that the pipeline is useful!

As of now, that type of functionality is not a built-in capability of grenepipe. I've had several users ask similar questions, but each with a slightly different requirement. Trying to implement some mechanism that would allow that within grenepipe would be rather cumbersome, both for me and for users, as it would need to be rather flexible and powerful to allow all these types of starting the pipeline somewhere in the middle, and hence require some kind of complicated way of specifying exactly which files at which step to take from where, etc...

That being said, I guess that "take the called files and continue from there" might be a case that multiple people might find useful. Not sure if I'll get to implement that as a special case... not planned for now, but I'll put it on my list.

Anyway, so, with currently no way of grenepipe offering that intrinsicaly, there is still another way, albeit also rather complicated. That is, you could try to trick Snakemake into doing what you want. You'd need to put all your samples into a big samples table, and provide the called files. Then, start snakemake in a dry run with --reason , and start investigating which files it wants in each step. You might be able to get snakemake to think that all files for the subsequent steps (genotyping, filtering, etc) are there, and that it can hence start from there. Be aware that snakemake uses time stamps of the files to determine which ones need to be re-computed (as its inputs have changed) - so, you'd definitely need to learn some snakemake internals to pull this off.

Honestly, I'm not sure that I would even want to bother with that - unless computational resources and compute time on your cluster are the limiting factor on which you really need to save. Then it might be worth tinkering around with that. Otherwise, the waaaaay simpler approach is to just create a big samples table with all your fastq files, and run the whole thing in one large run. Up to you!

Let me know what you think!

Cheers and so long Lucas

RvV1979 commented 10 months ago

Hi Lucas, Based on your descriptions I would indeed not want to bother with tricking snakemake into doing something it was not designed to do. In any case, your advice has saved me a lot of time trying to figure out the near impossible so many thanks for your quick reply.