tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
GNU General Public License v3.0
126 stars 28 forks source link

Duplicate sequences show a SNP difference of 1 #21

Closed ramadatta closed 6 years ago

ramadatta commented 6 years ago

Hi Seemann,

I have run gubbins with two exact same sequences and fed the "polymorphic_sites.fasta" to snp-dists program. I expect no SNP difference between these two sequences since they are exactly same sequences. But I keep on getting 1 SNP difference. Is it a bug or Am I missing something here.

Thanks.

My output looks like this:

$ snp-dists -b -c test.fasta

This is snp-dists 0.6
Read 2 sequences of length 100566
,Reference_CP028169.fasta.ref,CP028169_Duplicate.fasta
Reference_CP028169.fasta.ref,0,1
CP028169_Duplicate.fasta,1,0
tseemann commented 6 years ago

Did the file originate from a Mac or Windows computer? it could be wrong "newline" endings. What OS are you on? If on Mac or LInux, run dos2unix polymorphic_sites.fasta and mac2unix polymorphic_sites.fasta on them and see if it fixes it?

tseemann commented 6 years ago

I tried the other direction unix2dos and it didn't cause any problems.

cat foo.fa && snp-dists -b -c foo.fa
>S1
ATGC
ATGC
>S2
ATGC
ATGC

This is snp-dists 0.6
Read 2 sequences of length 8
,S1,S2
S1,0,0
S2,0,0

Email me the file if you like and I will try it out.

You can use od -a polymorphic_sites.fasta to inspect it at a character level.

ramadatta commented 6 years ago

Hi Seemann,

Thank you. I have generated the file in Linux only but could not trace the problem. Newlines seems to be correctly placed and may not be cause for this error.

Please find the file for your reference. Thanks so much!

test.fasta.zip

tseemann commented 6 years ago

Your sequences are not the same (but they are the same length). I put each sequence into a file called 1 and 2 Here is the first difference:

% cmp 1 2
1 2 differ: byte 8100, line 1
$ cut -c 8099-8101 1
GNN
$ cut -c 8099-8101 2
GGC

There are many more. They don't have the same distribution of letters:

Reference_CP028169.fasta.ref    dna     100566  |N    118  0.1% |A  22048 21.9% |T  22381 22.3% |G  27497 27.3%     |C  28361 28.2%
CP028169_Duplicate.fasta        dna     100566  |N     72  0.1% |A  22063 21.9% |T  22383 22.3% |G  27518 27.4%     |C  28369 28.2%

You only get a distance of 1 because -a wasn't used. If it is enabled, there are 47 differences.