harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Quality of life improvements #40

Closed tsackton closed 2 years ago

tsackton commented 2 years ago

Starting an issue to track things I see when running this pipeline on real data that don't rise to the level of actual bugs, but that could be worth a fixing just to make things a bit smoother.

So far nothing here seems to prevent the pipeline from continuing (except the genmap_map step which has not finished and needs to have a memory increment added). If everyone can add notes here for little things like this to just generally improve smoothness, we can start clearing these in bugfix as people have time.

tsackton commented 2 years ago
tsackton commented 2 years ago

Commit 24f51ab6c94eb345a7f927df017b01804f9b32ff in bugfix fixes the rule dedup log issue by replacing > with >> in the BuildBamIndex command, and adds incrementing memory to genmap_map. Right now this still uses the same mem as genmap_index which is probably overkill (should fix issue 2 and 3).

tsackton commented 2 years ago

Commit bd3de062ba0c8b8b2a1e1552f4e3dfdf5dc26e7a in bugfix adds bam_sumstats.txt to rull all, fixing issue 5

tsackton commented 2 years ago
tsackton commented 2 years ago

Commit 4005949e6c979f4f1f903c59887db8fe611ed7d4 and d88d98c0c1b2d830d8a3399011f39056cfcde8c8 reorganize genmap output, fix some bugs in file names for genmap, and redirect unneeded fastp output to /dev/null.

cademirch commented 2 years ago
tsackton commented 2 years ago
tsackton commented 2 years ago
tsackton commented 2 years ago
cademirch commented 2 years ago
  • [x] 11. The unzipped fastqs are really rough on storage and for decent sized pipelines quickly get out of hand. There should be a way to prioritize getting one sample through to bam before downloading the next, I think, and this would be a big help on the storage front without impacting throughoutput.

This should be fixed by marking the output of the download rule as temp, as well as the deduped bam. So everything from intermediate from fastq>gvcf.

Can't find it now, but I remember reading Johannes saying that temp files are prioritized. In practice I have seen this work. I too ran into file space issues when dealing with thousands of samples, even with the fastqs marked as temp, the bams take up alot of space too. But marking everything between fastq>gvcf (not the gvcf) allowed me to run for weeks just fine,

tsackton commented 2 years ago

This should be fixed by marking the output of the download rule as temp, as well as the deduped bam. So everything from intermediate from fastq>gvcf.

Can't find it now, but I remember reading Johannes saying that temp files are prioritized. In practice I have seen this work. I too ran into file space issues when dealing with thousands of samples, even with the fastqs marked as temp, the bams take up alot of space too. But marking everything between fastq>gvcf (not the gvcf) allowed me to run for weeks just fine,

For sure we should mark the fastq and the fastq.gz files as temp, that will help a lot. I think for now I'd prefer to keep the bams, until we have tested and run the coverage calculation steps and are sure we won't need them for anything else.

6bc7e28553dd5f7d20986067cb29a9d928d2eb27 does this.

tsackton commented 2 years ago
cademirch commented 2 years ago

[ ] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

I should be able to tackle this.

tsackton commented 2 years ago

Running some more: I think that task 1 is actually not an issue, when a snakemake job runs to completion without error the empty directories seem to be removed. I marked the fastp output as temp, so it should get cleaned once the collect_fastp_stats rule runs.

I also think point 4 has more to do with the I/O issues on our cluster than anything else, so closing.

And df618f0e7b4850079ac352db4fbb436fbda92c86 closes point 5 (05_vcfs as temp).

tsackton commented 2 years ago

Decided that using a conda prefix other than .snakemake/conda should not be a default option, so adding this to the compPopGen_ms repository instead.

cademirch commented 2 years ago
  • [ ] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

Taking a look at this now. For this case of fewer than 2 individuals it seems it would be simplest to skip all of the qc rules, does that sound reasonable? Would be simple to implement as an input function.

tsackton commented 2 years ago
  • [ ] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

Taking a look at this now. For this case of fewer than 2 individuals it seems it would be simplest to skip all of the qc rules, does that sound reasonable? Would be simple to implement as an input function.

Agree this is simplest for now. Should check by species though, as some runs have >1 species and only one of the species may have only 1 individual