moiexpositoalonsolab / grenepipe

A flexible, scalable, and reproducible pipeline to automate variant calling from raw sequence reads, with lots of bells and whistles.
http://grene-net.org
GNU General Public License v3.0
93 stars 21 forks source link

Updating run with new reads #14

Closed Brent-Saylor-Canopy closed 3 years ago

Brent-Saylor-Canopy commented 3 years ago

Hi, I have found your pipeline very helpful and would like to know the best way to use it when new samples are added regularly.

I have been running it separately for each set of reads I get in, but is there a way to get it to run only the new samples added to samples list?

When I try to add new samples to the samples file and launch the pipeline again I get a message saying there is nothing to be done.

I feel like I'm missing something obvious, as this seems like something snakemake is already good at.

Brent-Saylor-Canopy commented 3 years ago

Nevermind

lczech commented 3 years ago

Hey @Brent-Saylor-Canopy,

did it work in the end? Usually, this should work - as you said, snakemake is good at this. But I never explicitly tested this, so I'd be curious to hear your feedback!

Cheers and so long Lucas

Brent-Saylor-Canopy commented 3 years ago

It does work. I had copied a input file that was older than the output file initially, but that was fixed with a touch command.

The only error that comes up is that filtered/all.vcf.gz is write protected, so that needs to be deleted before it can be regenerated.

lczech commented 3 years ago

Ah thanks, the touch makes sense :-)

As for the write protected file: I think that it is good to keep it that way, in order to avoid accidental overwriting, meaning that users need to make sure that they actually want this.

Brent-Saylor-Canopy commented 3 years ago

Yes the touch is easy enough. The other option would be to add a step where links are made to each of the reads and updated on each run. That way the timestamp on the files corresponds to when the analysis was run rather than when the file was created.

Yes the file protection makes sense. I'm not sure how many people will have a similar use case to mine.

lczech commented 3 years ago

Hm, interesting idea to use symlinks. It might also solve some file naming issues with downstream tools. I am not entirely sure though that it would not also introduce new issues - I'll have to think about this. But thanks for the suggestion!

Also, here is another way to solve this: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#snakemake-does-not-trigger-re-runs-if-i-add-additional-input-files-what-can-i-do (the "snakemake" way).

Brent-Saylor-Canopy commented 3 years ago

I've encountered another problem with trying a larger scale test of adding new samples and rerunning the pipeline.

I keep getting an error when the call_variants rule is launched stating

ProtectedOutputException in line 38 of /mnt/Data1/GBS_data/grenepipe/rules/calling-haplotypecaller.smk:
Write-protected output files for rule call_variants:
called/Sample23.10A.g.vcf.gz

This seems to happen when the rule is launched, I'm not sure why but the pipeline seems to be recalling the variants for every sample, not just the new ones.

lczech commented 3 years ago

Not sure that I understand your question here.

Is the issue that it fails with the exception about write protected files? Because that is intentional: I've marked these files as write-protected in order to avoid accidental re-computation. Hence, in cases where you want to compute them again (which does not seem to be what you want here...), you'd have to delete them manually first - this is meant as a protection from mistakes that could otherwise lead to expensive re-computation.

If your question however is why these files are being re-computed in the first place: As you noted before, snakemake works by comparing timestamps of files and rules, and re-runs downstream rules if their input is newer than their output. I cannot tell from the information that you provided what exact chain of updates leads snakemake to want to do this, but you can call the pipeline with the -n --reason flags, which is a dry run (-n) that gives you this information for each executed rule. It might be that your input sample files were somehow updated, or some intermediate files changed.

Let me know if this helped and if you have further questions :-)

Brent-Saylor-Canopy commented 3 years ago

My question is why they are being recomputed at all. It is running call_variants on both the 100 samples that were run previously, and the 20 samples I added. I would expect that variants would only need to get called for the new 20 samples. The only thing I changed was to add new samples and change the "known-variants" setting in the config file. I'll have to test it out on another run. I removed the write protection on the files for now so I could get the updated results.

Would changing the config parameter "known-variants" cause each sample to get call_variants run on them again?

lczech commented 3 years ago

Ah yes, that is the reason then! The variant calling takes these known variants into account, and hence produces different output - hence, the variant calling needs to be repeated. You can check with the --reason flag as well, as there might be additional reasons, but this is definitely one of them!

Brent-Saylor-Canopy commented 3 years ago

Ahh, Ok. That setting won't change for me, so hopefully I'll only need to rerun the variant calling this one time.

Thanks for your help!

lczech commented 2 years ago

For anyone finding this in the future: In recent versions, I have removed the file write-protection, because users were confused by this. This comes at the risk that unnecessary computation is done though, but that can easily checked beforehand by running snakemake with -n or -nd for a dry-run to check that the rule executions are as expected.