Distances between two files

tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment

GNU General Public License v3.0

127 stars 28 forks source link

Distances between two files #42

Closed mbhall88 closed 2 years ago

mbhall88 commented 3 years ago

How easy/hard would it be to allow pairwise distance between two files?

For example, say I have two ways of producing consensus sequences and I want to compare the distance between these two methods. So, in this example, the distance matrix might not be symmetrical and the diagonal might not be 0.

There would obviously be the requirement that there is the same number of sequences, of the same length and with the same header ID.

kloetzl commented 3 years ago

It sounds to me like you don't want a distance matrix but just a distance list. You don't want to compare all n sequences in one set with all n in the other but just the "same" from the other set. I think this can be achieved with basic scripting. Somewhat along the following lines:

# Split files
awk -vRS='>' 'NR>1{print $0 > $1 ".fa"}' foo.fasta
mv *.fa foo

awk -vRS='>' 'NR>1{print $0 > $1 ".fa"}' bar.fasta
mv *.fa bar

# compare pairwise
for f in list of file names
do
    cat foo/$f bar/$f > tmp.fa
    snp-dists tmp.fa
done | tail -n 1 | modify output

mbhall88 commented 3 years ago

No, I actually want the pairwise distances in a matrix - i.e. I do want to compare all n sequences in one file with all n sequences in the other file.

As you mention, I am doing this currently with some scripting, but thought I would raise this issue as a feature request as it would be nice to have a fast, simple process for this and it seems to fit the scope of snp-dists - which is fast and simple :smiley:

kloetzl commented 3 years ago

If all you want is a snp-dists <(cat $*) that should be easy enough to implement.

mbhall88 commented 3 years ago

I guess the issue with concatenating the two files is

You would get duplicate sequence identifiers
You also end up getting the snp-dists of each file against itself also

mbhall88 commented 2 years ago

I implemented this in https://github.com/mbhall88/psdm