tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
GNU General Public License v3.0
126 stars 28 forks source link

Add option to report SNPs/aligned mbp between pairs instead of only SNPs? #36

Closed boasvdp closed 2 years ago

boasvdp commented 4 years ago

In some cases it might be useful to have an idea how much of the genome is considered for SNP differences. For example, take this alignment of 13 positions:

>strain1
ACGTGTCAGTACG
>strain2
ACCA---------
>strain3
ACGTGTCAGTATA

Using snp-dists, this would show 2 SNP differences between all isolates. However, the aligned length is vastly different between isolates. For a set of E. coli I'm working on now, I check the alignment lengths by extracting pairs with seqtk subseq and the alignments range from 500 kbp to 4,000 kbp. Expressing differences between isolates as SNPs/mbp alignment instead of just the SNPs has provided a better resolution for identifying similarity between isolates (in our experience).

Would you think this is a useful feature for snp-dists?

kloetzl commented 4 years ago

Here is a program that does almost what you want: dnaDist. It not only gives you the SNPs per seqlen. Even better, it gives you the Jukes-Cantor-corrected substitution rate. Which is what you need for a proper evolutionary distance. However, it does not do complete deletion i.e. remove a column if any of the sequences has a gap there. Here is a program of mine which does that, but for alignments in maf format: maf2dist. Finally, here is program which supports complete deletion and gives proper evolutionary distances even with unaligned sequences: phylonium.

</shamelessselfplug>

boasvdp commented 4 years ago

Thanks for the suggestions, I'll be sure to check them out!

tseemann commented 4 years ago

@boasvdp a large deletion is usually a single biological event, and the "cost"/"distance" is usually downweighted in some way rather than counting N times for N gaps.

boasvdp commented 4 years ago

Yes, I agree simply outputting SNPs/aligned mbp is not solving the issue completely. Definitely when other tools exist that seem to take care of this. This issue can be closed if you both think this would not be enough of an enhancement!