sanger-pathogens / snp-sites

Finds SNP sites from a multi-FASTA alignment file
http://sanger-pathogens.github.io/snp-sites/
Other
232 stars 50 forks source link

Missing data in vcf output format #64

Open MafaldaSFerreira opened 6 years ago

MafaldaSFerreira commented 6 years ago

Hello,

I was using your tool to obtain the SNPs in a fasta alignment and using vcf as an output format. I noticed that in cases where my reference is some nucleotide (A, C, G or T), samples that have missing data (N), will have a 0 - becoming REF genotype - and won't be coded as missing data anymore.

For example:

CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | BID_1 | CHS_2 | LAM_1 | LAM_2 | LAM_3 |

1 | 3032469 | . | C | T | . | . | . | GT | 0 | 0 | 0 | 0 | 1 |

When this position in the alignment is:

BID_1 C CHS_2 C LAM_1 N LAM_2 N LAM_3 T

I was wondering if this is the normal behavior? Or should I code missing data in another format (? or -), so that missing data will be properly noticed by the tool?

Thanks, Mafalda

ViriatoII commented 4 years ago

I've very interested in knowing the answer to this. No one helps?

BiologicalScientist commented 3 years ago

I have also noticed this. I assume it is due to the fact that N is the ambiguity code for any base so there isn't strictly a variant in that location. It would be nice to have the option to call strictly anything that differs from the reference even if there is ambiguity or missing data. Not sure if the other ambiguity codes are similarly called the same

As a workaround for this I believe you could change all Ns in your fasta file to something that doesn't have a IUPAC meaning like a J or means missing data which is typically . or - That won't be confused for another base or amino acid (should you be using those). Just make sure to change it back in your output. Worth noting for an amino acid case you might find there to be more ambiguity codes that are an issue (though I haven't tried to see if Y and R etc also cause issues.)