iTaxoTools / TaxI2-legacy

Calculates genetic differences between DNA sequences
GNU General Public License v3.0
0 stars 0 forks source link

Excessively large distances with "already aligned" sequences #46

Closed mvences closed 3 years ago

mvences commented 3 years ago

I am currently analyzing a big dataset for which I would like to compute the kind of distance data that TaxI2 is producing. The dataset is already aligned, and I noticed that the program outputs very large distances >0.5 which are unrealistic. I have tested this with a smaller dataset (attached here) and I get the correct distances with maximum values around 0.1 when the program does the alignment, but very large values when I click the "already aligned" option.

The sequences (because they are aligned) contain a number of gaps at the beginning and end. Maybe the program is removing these before starting the distance calculations? Then it would basically compare unaligned sequences. The idea would be to ignore gaps at the beginning and start (which the distance algorithms probably do anyway) but not remove them. Maybe this would solve the problem.

Here below is the example file and the two screenshots.

With aligning sequences in TaxI2, distances are correct:

Capture_withalignment

"Already aligned" option, distances are wrong (although sequences are aligned in input file): Capture_noalign

Example file:

brygootest100seqs.txt

necrosovereign commented 3 years ago

All the sequences in the file start with 'n-----'. I don't understand what 'n' signifies in this case. Is it supposed to indicates that the gap in the beginning is not a terminal gap?

mvences commented 3 years ago

Aaaah, you are right, I forgot to remove those "n" at the beginning. They were added to avoid artefacts when managing the sequences in Excel.

HOWEVER, it would be be good in fact if the program treats for the terminal stretches of sequences gaps - missing data n as well as question marks ? as equivalent. So any stretch like ----- ????? nnnn NNNNN N-????? n------- coming BEFORE or AFTER any other character (ACGTRYSWKM) would be treated equally and excluded from all calculations of distances.

necrosovereign commented 3 years ago

What about situations like this: atgncca atgtcca ? Should it be treated as insertion/deletion or replacement? And if it's a replacement, then, for the case of K2P distance, is it transition or transversion? And what if there is '?' instead of 'n'?

mvences commented 3 years ago

In this case, both n and ? should be treated as missing information. That is, at this position there could be anything (ACGT-) and for this sequence, the position should not be taken into account.

necrosovereign commented 3 years ago

But should it be counted for the p-distance with gaps?

mvences commented 3 years ago

No, also for p-distance with gaps, an n or question mark should not be counted.

mvences commented 3 years ago

I have just run the program after removing the "n" at the beginning, but it still calculates wrongly the distances, so this was not the origin of the problem.

necrosovereign commented 3 years ago

So for such situation

gg-ccnccta
ggaccaccaa

p-distance should be 1/8 and p-distance with gaps should be 2/9?

mvences commented 3 years ago

Yes, correct!