Closed boasvdp closed 2 years ago
Here is a program that does almost what you want: dnaDist. It not only gives you the SNPs per seqlen. Even better, it gives you the Jukes-Cantor-corrected substitution rate. Which is what you need for a proper evolutionary distance. However, it does not do complete deletion i.e. remove a column if any of the sequences has a gap there. Here is a program of mine which does that, but for alignments in maf format: maf2dist. Finally, here is program which supports complete deletion and gives proper evolutionary distances even with unaligned sequences: phylonium.
</shamelessselfplug>
Thanks for the suggestions, I'll be sure to check them out!
@boasvdp a large deletion is usually a single biological event, and the "cost"/"distance" is usually downweighted in some way rather than counting N times for N gaps.
Yes, I agree simply outputting SNPs/aligned mbp is not solving the issue completely. Definitely when other tools exist that seem to take care of this. This issue can be closed if you both think this would not be enough of an enhancement!
In some cases it might be useful to have an idea how much of the genome is considered for SNP differences. For example, take this alignment of 13 positions:
Using
snp-dists
, this would show 2 SNP differences between all isolates. However, the aligned length is vastly different between isolates. For a set of E. coli I'm working on now, I check the alignment lengths by extracting pairs withseqtk subseq
and the alignments range from 500 kbp to 4,000 kbp. Expressing differences between isolates as SNPs/mbp alignment instead of just the SNPs has provided a better resolution for identifying similarity between isolates (in our experience).Would you think this is a useful feature for
snp-dists
?