Quality of life improvements

tsackton commented 2 years ago

Starting an issue to track things I see when running this pipeline on real data that don't rise to the level of actual bugs, but that could be worth a fixing just to make things a bit smoother.

[x] 1. The pipeline leaves a lot of empty directories around as temp/intermediate files get deleted. Not sure how much we care about this, but to me it is a bit annoying. In particular:
- 00_fastqFiltered will be empty after bam creation finishes
- 01_mappedReads/postMerge and 01_mappedReads/preMerge will be empty after bam creation finishes
- the 02_bamStats is not particularly well organized and has both top-level and per-BioSample output.
[x] 2. Some resource requirements could be retuned, usually by increasing memory or adding a memory increment on failure:
- genmap_map is consistently getting oom-killed with 1Gb of RAM
- dedup occasionally oom-killed even with 9Gb of RAM, but these get caught on retry with increasing mem
[x] 3. Some log files are broken:
- rule dedup: the BuildBamIndex logging overwrites the MarkDuplicates logging. I am seeing a lot of "Error in executing rule dedup" but can't track down what the error is due to this. From slurm logs some of this may be memory related.
[x] 4. Some rules consistently have to be retried; while they seem to complete in the end it might be nice to figure out the issues (could be cluster-specific, though, e.g. using serial-requeue):
- dedup has this issue but may be memory related
- bam2gvcf has this issue but always seems to finish on retry; could be intermittent I/O issues (common on our cluster) or requeuing

So far nothing here seems to prevent the pipeline from continuing (except the genmap_map step which has not finished and needs to have a memory increment added). If everyone can add notes here for little things like this to just generally improve smoothness, we can start clearing these in bugfix as people have time.

tsackton commented 2 years ago

[x] 5. I am not sure that all of the summary statistics rules will run, as I can't find where the outputs are required as the inputs of another rule. This affects:
- bam_sumstats, collect_fastp_stats, collect_sumstats

tsackton commented 2 years ago

Commit 24f51ab6c94eb345a7f927df017b01804f9b32ff in bugfix fixes the rule dedup log issue by replacing > with >> in the BuildBamIndex command, and adds incrementing memory to genmap_map. Right now this still uses the same mem as genmap_index which is probably overkill (should fix issue 2 and 3).

tsackton commented 2 years ago

Commit bd3de062ba0c8b8b2a1e1552f4e3dfdf5dc26e7a in bugfix adds bam_sumstats.txt to rull all, fixing issue 5

tsackton commented 2 years ago

[x] 6. fastp leaves an unnecessary fastp.json and fastp.html in the run directory that should be removed.

tsackton commented 2 years ago

Commit 4005949e6c979f4f1f903c59887db8fe611ed7d4 and d88d98c0c1b2d830d8a3399011f39056cfcde8c8 reorganize genmap output, fix some bugs in file names for genmap, and redirect unneeded fastp output to /dev/null.

cademirch commented 2 years ago

[x] 7. I have written and tested a way of unnecessarily merging bams when there is only one set of reads for a given sample, I will implement that this week.
[x] 8. Could be worth considering a different duplicate marking software. I have been using Sambamba and it is much faster (almost 5x on comparisons I've done)

tsackton commented 2 years ago

[x] 9. The files in 05_vcfs can probably all be marked as temporary as once the merge is done these are not needed. Right now the filtered.vcf files seem to be deleted, but not the unfiltered vcf files or the idx files.

tsackton commented 2 years ago

[x] 10. When running multiple species through the pipeline, the conda overhead is annoying. Could be fixed by using the --conda-prefix option to cache conda environments in a fixed location, but perhaps this is not something that should be set as default?

tsackton commented 2 years ago

[x] 11. The unzipped fastqs are really rough on storage and for decent sized pipelines quickly get out of hand. There should be a way to prioritize getting one sample through to bam before downloading the next, I think, and this would be a big help on the storage front without impacting throughoutput.

cademirch commented 2 years ago

[x] 11. The unzipped fastqs are really rough on storage and for decent sized pipelines quickly get out of hand. There should be a way to prioritize getting one sample through to bam before downloading the next, I think, and this would be a big help on the storage front without impacting throughoutput.

This should be fixed by marking the output of the download rule as temp, as well as the deduped bam. So everything from intermediate from fastq>gvcf.

Can't find it now, but I remember reading Johannes saying that temp files are prioritized. In practice I have seen this work. I too ran into file space issues when dealing with thousands of samples, even with the fastqs marked as temp, the bams take up alot of space too. But marking everything between fastq>gvcf (not the gvcf) allowed me to run for weeks just fine,

tsackton commented 2 years ago

This should be fixed by marking the output of the download rule as temp, as well as the deduped bam. So everything from intermediate from fastq>gvcf.

Can't find it now, but I remember reading Johannes saying that temp files are prioritized. In practice I have seen this work. I too ran into file space issues when dealing with thousands of samples, even with the fastqs marked as temp, the bams take up alot of space too. But marking everything between fastq>gvcf (not the gvcf) allowed me to run for weeks just fine,

For sure we should mark the fastq and the fastq.gz files as temp, that will help a lot. I think for now I'd prefer to keep the bams, until we have tested and run the coverage calculation steps and are sure we won't need them for anything else.

6bc7e28553dd5f7d20986067cb29a9d928d2eb27 does this.

tsackton commented 2 years ago

[x] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

cademirch commented 2 years ago

[ ] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

I should be able to tackle this.

tsackton commented 2 years ago

Running some more: I think that task 1 is actually not an issue, when a snakemake job runs to completion without error the empty directories seem to be removed. I marked the fastp output as temp, so it should get cleaned once the collect_fastp_stats rule runs.

I also think point 4 has more to do with the I/O issues on our cluster than anything else, so closing.

And df618f0e7b4850079ac352db4fbb436fbda92c86 closes point 5 (05_vcfs as temp).

tsackton commented 2 years ago

Decided that using a conda prefix other than .snakemake/conda should not be a default option, so adding this to the compPopGen_ms repository instead.

cademirch commented 2 years ago

[ ] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

Taking a look at this now. For this case of fewer than 2 individuals it seems it would be simplest to skip all of the qc rules, does that sound reasonable? Would be simple to implement as an input function.

tsackton commented 2 years ago

[ ] 12. The plink step fails if there are fewer than 2 individuals, but some of the outgroup species we are running are just one individual, and the use case for this pipeline could include other single-individual runs. Need to think about how to do this, exactly.

Taking a look at this now. For this case of fewer than 2 individuals it seems it would be simplest to skip all of the qc rules, does that sound reasonable? Would be simple to implement as an input function.

Agree this is simplest for now. Should check by species though, as some runs have >1 species and only one of the species may have only 1 individual

harvardinformatics / snpArcher

Quality of life improvements #40