iTaxoTools / TaxI2

Calculation and analysis of pairwise sequence distances
GNU General Public License v3.0
0 stars 0 forks source link

Add option of multiple sequence alignment before performing all-against-all comparisons #7

Open mvences opened 3 years ago

mvences commented 3 years ago

For all-against-all comparisons, in the current TaxI2 implementation, a large number of pairwise sequence alignments need to be performed. This is a very accurate way of calculating distances. Alternatively, the user can input a file with already aligned sequences.

To improve the performance dealing with large, unaligned sets of sequences, it would be fantastic to integrate a multiple sequence alignment option in the program. That is, the program takes the unaligned sequences, performs with the respective program the multiple alignment, and this data set then automatically enters the all-against-all comparison using the "already aligned" option.

There are many multiple alignment codes out there, but in my experience, the best one to deal with the typical sets of sequences to be entered in TaxI3, and able to align large sets of thousands of sequences using some fast options, is MAFFT. The source code is (of course) in C and available here: https://mafft.cbrc.jp/alignment/software/source.html

MAFFT consists of several single pages of C code, but many of these perform special calculations that we may not need, so maybe, not all of the MAFFT code needs to be integrated.

The way I interpret the MAFFT license notice, we would be allowed to use it as long as retaining a copyright notice, but if we do so, I would in any case contact the MAFFT author. But before doing so, we need to see if this is possible at all, and if we want to include the entire MAFFT package in Taxi3 or can extract only a part of it with a specific alignment algorithm.

Of course, if the relevant part of the MAFFT algorithm can be re-coded to fit seamlessly into the TaxI3 logic, maybe even making use of Rust (???), it could also be an option. I just don't know which would be the best strategy.

@StefanPatman is looking into the functionalities of MAFFT as well, to evaluate the possibility to write a Python wrapper for it. So, before starting to work on this it will be useful to coordinate with him.