Implement high-throughput sequence comparison via BLAST or BLAT or MMseq2

We have seen that the pairwise alignment step in TaxI2 is very memory- and time-intensive. Certainly this can be alleviated using the Rust option, which is good because the pairwise alignment mode is an important feature of the TaxI concept - after all, such pairwise alignments give the most accurate results. So, this should certainly be kept also in TaxI3.

However, in order to make the program scalable to really (!) large datasets of maybe millions of sequences, we need an alternative, both for the "all-against-all" and for the "compare against database" approach.

The answer to this lies in the use of other algorithms that take heuristic shortcuts rather than doing accurate pairwise alignments. One of the most used algorithm is BLAST. A good explanation of how it works can be found on its Wikipedia page: https://en.wikipedia.org/wiki/BLAST_(biotechnology)

Obviously, BLAST and other similar tools are mostly written in C or C++ and therefore we must find a way of including them in TaxI3 - either by wrapping the C code, or by using a Python variant if one exists, or by translating the algorithm to Python or Rust. Which path to take needs to be discussed.

I can see two applications for this. For this I will now write about BLAST, but probably other algorithms are more suitable as I explain further below.

In the "compare against database" mode we can use BLAST in two ways: We can just 100% rely on BLAST, and just output the sequences that in the BLAST search came out most similar to the query, along with the sequence identity value provided by the inaccurate BLAST algorithm. Or we can use BLAST to find a selection of, say, the 10 most similar sequences to a query, then for these perform pairwise alignments and keep the most similar sequence and respective distance values from the pairwise alignments. Especially when the database is very large, this second option should be almost as accurate as the current implementation but much, much faster.
In the "compare all against all" mode, we could use BLAST instead of pairwise alignments, and provide all of the output for the BLAST identity values (so, there would be no JC, K2P etc. distances). This would allow including really huge data sets into the all against all comparisons. Of course the distance/identity values would be less accurate, but the advantage would be the high-throughout approach that can be scaled to hundreds of thousands of sequences.

Now, which algorithm should be used? The original BLAST certainly would be an option, but I suggest to look instead at two alternatives. Both of these are substantially faster than BLAST (even if less accurate - but for accurateness we already have the "classic" mode with pairwise alignments). Maybe even both of these options can be included, but first we should just see what is possible.

The program MMseqs2 is open source, is said to be extremely fast, and has several very interesting options. Indeed, from what I understand, it seems to include already "natively" an all-against-all comparison option, linked to a clustering function. This of course is fantastic - if then we could output the cluster information as in TaxI2, with "spart" file and so on, it would be extremely good. https://github.com/soedinglab/MMseqs2
The program "BLAT" is substantially faster than BLAST (although probably slower than MMseqs2 and without clustering function), and it seems that there is a pure Python implementation. https://github.com/Addithor/BLAT-BLAST-Like-Alignment-Tool Interestingly this repository has no LICENSE notice, but I am pretty sure it is meant to be open access and we could contact the developers (maybe create an Issue) to ask. But before, it will be good to check if this REALLY is pure Python without dependency of having somewhere the C code installed/compiled, and if it would be possible to integrate it in TaxI3.

iTaxoTools / TaxI2

Implement high-throughput sequence comparison via BLAST or BLAT or MMseq2 #5