harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Postprocessing update #199

Closed erikenbody closed 1 week ago

erikenbody commented 1 week ago

I'm addressing two issues here. The first is that our maf filter isn't correctly set in the strict filter (we don't remove SNPs at 1-params.maf) and I think this is a necessary change. The second is identifying SNP positions that are of length > 1 and removing them.

Here is an example of a site that is retained by the filters currently:

JALCYL010000001.1       775994  .       AC      CC      11928.3 .

The original raw vcf call is as follows:

JALCYL010000001.1       775994  .       AC      CC,A    11928.28        .

So in postprocessing, we remove the deletion, but the length of the genotype doesn't change. We could probably recode these as SNPs, but given that SNPs overlapping indels are more likely to be errors, it seems sensible to just remove them altogether. The formatting as is affects some downstream analyses that extract nucleotides for genotype positions.

I am interested in input on the second one, especially if anyone has a more clever way of doing it. This is reasonably fast, anyway.