Add options to (i) impute missing distances and (ii) to autocorrect distances from incomplete sequences

Some genes often used for DNA barcoding (ITS, 18S, 16S) consist of stretches that are very constant and others that are more variable (corresponding to stems and loops of secondary RNA structure). In protein-coding genes such as COI or cytb, this is less acute, but still may apply in some cases. For such genes or gene fragments, if incomplete sequences are included in the alignment and distance calculation (e.g., missing data at the beginning or end of the respective sequence), there will be biased pairwise distances as the distances are only calculated from the available part of the sequence (which may have a higher or lower proportion of variable/conserved sites than the overall alignment). Furthermore, in some cases, calculating a distance is impossible, i.e., if between two sequences there is no overlapping part: some sequences may miss a large number of nucleotides at the beginning, others a the end, so that there is no overlap between them. In such cases, no calculation of pairwise distances is possible.

It will be good to account for these two cases. We need to inform users that, of course, it is better to only use sequences with no or only few missing data, but sometimes it will be desirable to include such sequences, e.g. from unique individuals, in the calculation. These two options would be included in the backend and would add to the available choices of distance calculations in the "all against all" mode, so that the imputed/corrected distances can be used e.g. in ASAP.

Case 1 (correcting distances arising from incomplete sequences).

I have not found any existing algorithm for this. It certainly is solvable with machine learning but this might be overkill. We can solve this with a simple deterministic algorithm which only requires that a minimum of 2 (I think we should require 3-5) sequences in the alignment are complete over the total alignment. Then, the following procedure can be applied to correct the distances:

Assume that gaps at the beginning and end of the sequences are true missing data (incomplete sequences). Assume that n N ? are always missing data (at the beginning and end of the sequences).
Take those sequences that are complete over the full alignment (F). The method will only work if there are at least three such complete sequences.
For each pairwise comparison involving an inconplete sequence, identify the alignment positions that are being compared (and used for pairwise distance calculation) (C) and those that are missing and not being compared (M).
For the complete sequences, calculate the average pairwise distance for the NC and F stretches and calculate F/NC. If the NC stretch is hypervariable, it will have a higher distance compared to the entire sequence and F/NC will be >1. If the NC stretch is highly conserved, F/NC will be <1.
Correct the respective pairwise comparison by multiplying it with the F/NC ratio.

If NC is hypervariable, then the original pairwise comparison will be based mainly on very constant parts of the alignment and the p-distance will be too low. By multiplying with F/NC it will increase.
If NC is conserved, then the original pairwise comparison will be based mainly on hypervariable parts of the alignment and the p-distance will be too high. By multiplying with F/NC it will decrease.

AATAATTTAT AAAAATGGAA p = 4/10 = 0.4

AATAATTTAT -----TGGAA p = 3/5 = 0.6 F/NC = 0.67

AATAA----- AAAAATGGAA p = 1/5 = 0.2 F/NC = 0.4/0.2 = 2

Case 2. Missing distances

Most programs will fail or give an error message if two non-overlapping sequences are found and therefore, no distance can be calculated among them. I am right now not sure how TaxI2 handles such cases. It would be desirable to have three options:

The program provides an error message and excludes the non-overlapping distances from the calculation, so they are both absent from the matrix that is produced (also distances to other sequences are not produced and not shown).
The program provides an error message and for the non-overlapping sequence comparisons, adds "NA" for not available to the distance matrix.
The program imputes the missing distances

For imputing the missing distances, there are two machine-learning approaches that probably can be implemented quite easily as they are programmed in Python: https://github.com/Ananya-Bhattacharjee/ImputeDistances These require as input a matrix with missing data as periods, so probably, the TaxI2 distance calculation should first produce such a matrix file and in a second step, the imputation would take place. To be discussed.

iTaxoTools / TaxI2

Add options to (i) impute missing distances and (ii) to autocorrect distances from incomplete sequences #40