harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

add info about starting from bams #139

Open cademirch opened 7 months ago

cademirch commented 7 months ago

Adding some information about starting from BAMS and later GVCFS.

However, this currently wont work out of the box as the bam summary stats needs the fastp summary files.

The bandaid would be to comment out https://github.com/harvardinformatics/snpArcher/blob/6c8951b6b22d3cdcbe15292745dcac742e0d34fd/workflow/rules/common.smk#L29, but this would likely break the QC stuff.

Probably best to add a flag for starting with BAMs that exludes the fastp summary from bam sumstats. Can think of a flag for GVCFs too. Appreciate thoughts here @tsackton and @erikenbody.

tsackton commented 7 months ago

It looks like we use fastp just for reads passing filters? This seems probably not necessary; perhaps a better solution is to just remove the fastp dependence from sumstats.

cademirch commented 7 months ago

It looks like we use fastp just for reads passing filters? This seems probably not necessary; perhaps a better solution is to just remove the fastp dependence from sumstats.

Yeah we do. I agree with the fastp stats should be independent from the bam summary stats. Though it then brings up the question of what we do with the fastp stats? I guess they can be there for downstream analysis by the user if they are interested.

tsackton commented 7 months ago

I don't think that we really need to analyze these stats. They may be useful for someone to look at if they, for example, see weird things in the mapping rates or something, but I've never actually looked at them....