sanger-pathogens / snp-sites

Finds SNP sites from a multi-FASTA alignment file
http://sanger-pathogens.github.io/snp-sites/
Other
232 stars 50 forks source link

Improvement: let user specify pure and ambiguous bases #103

Open arturotorreso opened 3 years ago

arturotorreso commented 3 years ago

This is an amazing tool, and I ended up relying quite a lot on it due to its speed!

One improvement I would add is letting the user specify what a "pure base" is and what an "unknown" base is. This feature is inspired by two situations I run into often: 1) Many times "-" actually symbolizes a proper polymorphism, and for non-phylogenetic analysis users may want to keep them in their snp-aligment. 2) I often use IUPAC ambiguity codes in my alignments (M,R,W...), and in those positions with REF+IUPAC code, the column will be kept.

I think the change would be relatively easy to implement. I did change the src code (objects "is_unknown" and "is_pure" from alignment-file.c) before compiling it so it's suitable to my needs, but other users may want to benefit from this as well.