veg / tn93

TN93 fast distance calculator
MIT License
15 stars 7 forks source link

Question about results when using SKIP as the match mode #27

Closed reagank closed 2 years ago

reagank commented 2 years ago

Hello,

My lab currently uses this TN93 implementation both by itself and as part of HIVTrace. We have a need for a pure python implementation, which I've been working on. I'm comparing the output from my implementation to the output from this implementation to ensure I'm producing consistent results. As I've been looking at the results produced when using the SKIP match mode I'm seeing a difference in the values produced when dealing with a sequence that has ambiguities.

I looked deeper into the final matrix of counts, and I'm finding that the ambiguities seem to be being counted rather than skipped. I want to make sure I have the behavior correct - should the SKIP mode pass over ambiguous nucleotides the same way it passes over gaps?

I have a small example file to demonstrate what I mean (I had to add a .txt suffix, but it's a standard fasta file). The file has two 240nt sequences in it - the first has no ambiguous nucleotides, and the second has seven ambiguous nucleotides.

When I run tn93 on this file (setting the ambiguity mode to SKIP and setting the reporting threshold value to 1) I get a distance of 0.0190 and these counts:

       A     C     G    T
A   89.5     0      1     0.5 
C      0     37     0.5   0.5 
G      1     0.5    55.5   0 
T      0     0.5    0     53.5

The counts for this sum to 240 - the full length of the sequence.

When I run the python version that I've developed I'm getting a much lower distance (0.0043) and I'm getting these counts:

     A     C     G    T
A    89     0     1     0 
C    0     36     0     0 
G    1      0     54    0 
T    0      0     0    53

This sums to 233, which is the number I expect - the full sequence minus the seven positions where there were ambiguous nucleotides. Should there be fractional counts when SKIP is selected? Since it doesn't try any resolutions I thought the counts would all be integers.

Thanks very much!

short_example_seqs.fasta.txt

spond commented 2 years ago

Dear @reagank,

This may have been an issue related to https://github.com/veg/tn93/issues/24 although I am not sure. The current tn93 version, with some debugging text added yields the following results...

./tn93 -q -a skip ~/Downloads/short_example_seqs.fasta.txt 

(0,0) = 89
(0,1) = 0
(0,2) = 1
(0,3) = 0
(1,0) = 0
(1,1) = 36
(1,2) = 0
(1,3) = 0
(2,0) = 0
(2,1) = 0
(2,2) = 54
(2,3) = 0
(3,0) = 0
(3,1) = 0
(3,2) = 0
(3,3) = 53

which seems correct since a direct examination of character by character differences returns, so only one difference (A=>G) should be counted.

C=>S
C=>K
G=>M
A=>G
A=>W
G=>R
G=>S
T=>Y

Would you try with v1.0.9 and see if the issue is still there?

Best, Sergei

reagank commented 2 years ago

Hi Sergei,

Thanks for getting back to me so quickly. I've had a chance to look at it, and I can get it to produce the expected value and counts.

What I was doing wrong was passing the ambiguity mode in ALL CAPS. When I tried again using the lower case then it works correctly. The results I'm seeing must be the software not recognizing the mode and using "subset" and treating the all caps word as the nucleotides to resolve.

I really appreciate the assistance, I'll go ahead and close this.