Closed mbhall88 closed 2 years ago
It sounds to me like you don't want a distance matrix but just a distance list. You don't want to compare all n sequences in one set with all n in the other but just the "same" from the other set. I think this can be achieved with basic scripting. Somewhat along the following lines:
# Split files
awk -vRS='>' 'NR>1{print $0 > $1 ".fa"}' foo.fasta
mv *.fa foo
awk -vRS='>' 'NR>1{print $0 > $1 ".fa"}' bar.fasta
mv *.fa bar
# compare pairwise
for f in list of file names
do
cat foo/$f bar/$f > tmp.fa
snp-dists tmp.fa
done | tail -n 1 | modify output
No, I actually want the pairwise distances in a matrix - i.e. I do want to compare all n sequences in one file with all n sequences in the other file.
As you mention, I am doing this currently with some scripting, but thought I would raise this issue as a feature request as it would be nice to have a fast, simple process for this and it seems to fit the scope of snp-dists
- which is fast and simple :smiley:
If all you want is a snp-dists <(cat $*)
that should be easy enough to implement.
I guess the issue with concatenating the two files is
I implemented this in https://github.com/mbhall88/psdm
How easy/hard would it be to allow pairwise distance between two files?
For example, say I have two ways of producing consensus sequences and I want to compare the distance between these two methods. So, in this example, the distance matrix might not be symmetrical and the diagonal might not be 0.
There would obviously be the requirement that there is the same number of sequences, of the same length and with the same header ID.