harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

High cov options #1

Closed sjswuitchik closed 3 years ago

sjswuitchik commented 3 years ago

Updated order of channels in env build for bam2vcf_gatk, updated filter expression and output compression in bam2vcf_gatk rule gatherVcfs, added high coverage options in config.yaml under 'parameters you may need to change' and updated bam2gvcf rule in bam2vcf_gatk accordingly

tsackton commented 3 years ago

The high depth and low depth options looks good.

For variant filtration, should we consider just having a separate filter for each metric? I'm not sure there is good logic for why Read Position Rank Sum and Quality by Depth are in the same filter.

Also, I think it would make sense to add a QUAL < 30 filter, since by default VCFs are not filtered on QUAL by GATK.

Could you update these filters and then test on a few (small?) VCFs? You might also test the --invalidate-previous-filters option, as I believe it removes previous filter commands to give us clean vcf output.

References: https://gatk.broadinstitute.org/hc/en-us/articles/360037499012?id=3225 https://gatk.broadinstitute.org/hc/en-us/articles/360035531112 https://gatk.broadinstitute.org/hc/en-us/articles/360035890471

sjswuitchik commented 3 years ago

Filters have been separated and updated, will test on data

sjswuitchik commented 3 years ago

I've tested the high coverage options (toggled with commenting in config), the new separated filter expressions, added a new rule to compress and index the final VCF, as well as a QUAL filter.

tsackton commented 3 years ago

A few comments:

1) I think it should be possible for gatherVcfs to output compressed files, so that you don't need a separate compress rule, which is inefficient.

2) The QUAL < 30 vcftools filter will remove sites that have QUAL < 30, instead of tagging them with a filter. We don't want any sites to be removed from the final vcf, so this should be written as a GATK filter not a vcftools removal line.

3) It would be more efficient to filter, then gather, instead of gather, then filter. That way, the filter can be done per-interval and run more efficiently across many nodes.

4) The name of the final output should be more informative, something like {speciesName}_final.vcf.gz instead of Combined_hardfilters which will be the same for all species and has the potential to produce conflicts.

sjswuitchik commented 3 years ago

Rules have been updated for bam2vcf (both GATK & FB) to filter then gather, with a compressed and indexed output. QUAL_filter has been added and the final output VCF filename contains a species code/name defined in the config.