moiexpositoalonsolab / grenepipe

A flexible, scalable, and reproducible pipeline to automate variant calling from raw sequence reads, with lots of bells and whistles.
http://grene-net.org
GNU General Public License v3.0
93 stars 21 forks source link

Call variants being rerun when adding new samples #18

Closed Brent-Saylor-Canopy closed 2 years ago

Brent-Saylor-Canopy commented 2 years ago

HI,

I've been running your pipeline to add 4-8 new samples to a previous analysis at a time as new data comes in. For the most part it works well, but I'm finding that the call_variants rule is being rerun for every new sample, even though I'm only adding a couple of samples. Is there a way to get these not to re-run, it's adding a lot of extra computation time.

Here is an example dry run job list from trying to use haplotype caller 39 samples, 8 of which are new and 2 of which are additional files for the same sample.

Job stats:
job                              count    min threads    max threads
-----------------------------  -------  -------------  -------------
all                                  1              1              1
bam_index                            8              1              1
call_variants                      407              1              1
combine_calls                       11              1              1
fastqc                              16              1              1
genotype_variants                   11              1              1
hard_filter_calls                    2              1              1
map_reads                            8              1              1
mark_duplicates                      8              1              1
merge_calls                          1              1              1
merge_variants                       1              1              1
multiqc                              1              1              1
picard_collectmultiplemetrics        8              1              1
qualimap                             8              1              1
samtools_flagstat                    8              1              1
samtools_stats                       8              1              1
select_calls                         2              1              1
trim_reads_pe                        8              1              1
vcf_index_gatk                     407              1              1
total                              924              1              1
lczech commented 2 years ago

Hi @Brent-Saylor-Canopy,

you already closed the issue before I could answer - does that mean that you have solved it already? If not, feel free to re-open as needed!

In short, from what you posted, its hard to guess the reason why this is happening. The snakemake command line interface however offers an option for this, which accurately is called --reason. If you add this option together with the dry-run option (-n), you should get a (very long...) listing that tells you for each rule executing why this is invoked. You should then be able to figure out which updates snakemake things are necessary, and why your variant calls need to be re-run.

My gut feeling is that some timestamps of input files got updated somehow or something stupid like that, and that hence snakemake thinks those files are changed and so their downstream rules need updating... If that's the case, you can fix this by manually changing the timestamps. But let's see what's going on there first ;-)

So long Lucas

Brent-Saylor-Canopy commented 2 years ago

Hi Lucas, you called it. The references had been re-indexed as part of another pipeline and snakemake wanted to recall everything that happened before that .

lczech commented 2 years ago

Haha yes, happened to me before as well, no worries. Glad you figured it out ;-)