Implement "alignment free" distance calculation

mvences commented 3 years ago

One goal of the new version of the program is to be scaled to data sets of potentially millions of sequences. With such numbers, the traditional pairwise alignments are not really feasible anymore, but there are so-called "alignment free" methods to compare sequences.

The principle is in a way similar to the BLAST and BLAT searches, but here, no database needs to be built. And there seem to be some rather user-friendly scripts that could be incorporated in TaxI3 (obviously acknowledging the sources of the code). [therefore I have closed the issue with the BLAST/BLAT searches ... if this alignment free searches can be successfully added, we probably won't need it]

The best option for us, also because it is nicely documented, would be AlfPy: https://github.com/aziele/alfpy

If you look at their website under "Overview" here: http://150.254.123.165/alfree/download/

you will see that they have a series of separate scripts, each of which can calculate different alignment-free distance metrics. The output apparently is produced as a matrix which probably can quite easily be parsed into the formats of TaxI3.

At this time I am not sure which of these many distance measures would work most efficiently with the kinds of data sets we have, but I will do some testing on this. Maybe for now, you can check if this can be implemented in TaxI3, focusing on two of the scripts ... I would suggest the first one, Base-base correlation (calc_bbc.py) and the standard word-based distance with the "Google" distance (calc_word.py).

The idea would be to test this as a third option for calculating the all-against-all distances, so we have (1) the default (pairwise alignment), (2) the already aligned option, and (3) the "use alignment free" distances option (with different options for the distance setting).

In the GUI, the respective checkbox can be put just somewhere ... it is clear that the GUI is getting very cluttered, but we will later anyway overhaul it completely, so we don't need to make it very nice or easy to understand at this time.

necrosovereign commented 3 years ago

I've checked out alfpy code. I think for calculating distances between million of sequences, it would require at least 4 TB of RAM (probably significantly more)

mvences commented 3 years ago

Yes, this is probably true for an "all-against-all" comparison, here we will always run into trouble with an excessive number of comparisons, no matter how fast the algorithm is.

But what about the comparison with a reference database? Here, for example, you could have input data of a million sequences, but the reference database is maybe only 1000 or 2000 sequences. In such case, similar to what we are doing now, it should be possible to implement the code with a limited amount of RAM since the sequences will be checked one after the other against the reference sequences.

Only that it should be substantially faster than with a regular pairwise alignment, and therefore more realistic for very large sets of query sequences...

So I can still see a substantial benefit of this algorithm to scale up the number of sequences that can be compared in a given time.

necrosovereign commented 3 years ago

It turns out that alfpy only calculates pairwise distances within one set of sequences, so for now I only implemented it for "all-against-all" mode (pull request #16). But I have an idea on how to adapt it to "Compare against reference" mode

iTaxoTools / TaxI2

Implement "alignment free" distance calculation #12