Add new program mode: dereplicate

mvences commented 3 years ago

I have been thinking about new functions that could be added to TaxI3 to make the program's "skills" applicable to additional tasks that often need to be performed during the analysis of DNA sequences. This issue describes the first of these, and I will create a second issue with a further such addition.

For this new addition, it will be good in the TaxI3 GUI to create a third "mode" (for now, we have "compare all against all" and "compare to reference", and now we would add "dereplicate" (and as a fourth mode, "decontaminate" for which I will add a further issue to describe it).

[I know the current GUI will become more and more cluttered with different options, but please don't worry about this - as discussed before, in the end Stefanos will add a totally new graphical interface to make this easy to understand and user-friendly]

The task to be performed is relatively simple: Often, our long lists of sequences will contain sequences that are identical or very similar to each other, and we want to "dereplicate" them, that is, keep only one of these repeated sequences. Because TaxI3 has now implemented the fast alignment-free distance calculation and the "Rust" improvement, it is perfect to perform such a task.

There are algorithms and tools published that can do such things, but they all have important shortcomings, so I think it will be better (and should be easy) to come up with something from scratch. The program should just go for an optimized "compare all against all" comparisons and save a new data file corresponding to the input file but with all replicate sequences excluded. Also, it should output one file with some general statistics (how many sequences excluded etc).

For the "compare all against all" variant to be used in this option, there is one thing that maybe can be changed in order to optimize it for speed and memory: if the program finds a "replicate" sequence that (according to the user options, see below) should be excluded, then there is no need anymore to include this sequence in any further comparisons. Of course, it will be in itself a memory problem to keep track of such sequences that should be excluded. I am not sure which would be the best way to deal with this. With small files of maybe 500 or 1000 sequences, it should be no problem anyway as the alignment-free comparisons are very fast. For larger files, it might be worth thinking about how to best implement this. In any case, the comparison should be performed in a way that allows analyzing data sets of hundreds of thousands of sequences without running into memory problems. Maybe one option would be to go in iterations: once that 1000 or 2000 sequences that should be excluded have been identified, the program writes a temporary file with all except these sequences, and starts the process from scratch.
One of the user options should be to decide for a similarity percentage to flag sequences as replicate. As default it should be 100% (only fully identical sequences are replicates, but independent of their length: If one sequence is 500 bp and one 100 bp, and they are identical over their overlapping length, then they do count as replicates).
To fine tune the program for problems with overlap and sequence length, the user can select to just remove sequences from the data set that are shorter than a certain length
If replicates are found, which of them should be kept and which should be excluded? As a default, I would say, always keep the longest sequence (not counting gaps or missing data). But give the user the option to select keeping the sequence with the smallest amount of missing data (N or ?). That is, if there is a 500 bp sequence that has a stretch of 20 N, and a 300 bp sequence free of N, then keep the shorter (300 bp) sequence (which the user may decide is overall of better quality).

Output:

As I said, there should be one output file with the original set of sequences minus all the replicates that have been excluded. Because these files can be very large, it is better not showing them in the GUI, but automatically saving them, maybe just in the same folder where the input file is and adding "_dereplicated" to the filename.
Maybe it would be a good idea to give the user the option "Save excluded replicates to separate file". In this case, in the same folder, a second output file is saved, with the addition to the filename "excluded_replicates"
Then, one text file should be shown in the GUI with some general information about the concluded process: For now, it should be sufficient to print the names and folder of the output files, number of sequences processed, number of sequences excluded. Later we can maybe add some general statistics, like the average length of sequences kept and average length of sequences excluded, etc.

necrosovereign commented 3 years ago

I'm not sure how the alignment-free distance corresponds to similarity percentage

mvences commented 3 years ago

This is a good point, in fact, I also have no idea. Also, I am not 100% sure which of the alignment-free distance calculations of Alfpy you have implemented. So the best will be to test this empirically. Unfortunately since I am travelling, I cannot run the program myself.

I here attach an aligned fasta file of 100 bp while where the first sequence is a reference, the next two differ by 1%, the next two by 2%, the next two by 5%, the next two by 10%, and the last by 20%. Can you please run this as all-against all using the alignment-free option and send me the results? With this, it should be possible to correlate more or less how the alignment-free distance corresponds to the normal distance, even if I expect the correlation will not be exact when calculated for complex sequences.

By the way, I cannot exactly remember, has the alignment-free option already been implemented also for the "compare against reference" mode? I think yes but I cannot fully remember...

sequences_distance_test.fas.txt

necrosovereign commented 3 years ago

Yes, the alignment-free is already implemented for all modes.

Here is the alignment-free distance table that has been calculated from your file: sequence_distance_test_table.txt

mvences commented 3 years ago

Thank you for the table. So, with this data I have made the following quick and dirty correlation:

As I expected, the correlation is not exact. So we could give the user the following options which should be more or less working:

Alignment-free distance 0.00-0.07 (corresponds to 98-100% sequence identity) [let's set this as default!]
Alignment-free distance 0.00-0.10 (corresponds to 95-100% sequence identity)
Alignment-free distance 0.00-0.25 (corresponds to 90-100% sequence identity)
Alignment-free distance 0.00-0.31 (corresponds to 80-100% sequence identity

necrosovereign commented 3 years ago

I've attempted to implement this mode, but the new options don't fit into the window. I think the GUI needs to be reworked before new features can be added.

mvences commented 3 years ago

Yes, it is clear that the GUI is already incredibly full. However, I think it will take some weeks or even months before Stefanos can work on the new GUI since he will start with concatenator. So it will be better not to wait for this before adding the new functions.

Maybe the two following options would be possible:

Either just increase the size of the current tkinter-GUI and add the new features in a new, empty part of the size-increased GUI canvas (but only if this can be done without much effort, it is not worth to take a lot of time to change the tkinter GUI).

Or, alternatively, just add three radiobuttons for the new functions, with very short descriptions (DEREP for dereplicate mode, "DECONT" for decontamination, "BLAST" for external Blast, and if necessary to make some space, abbreviate the two exoisting modes "ALL" for all-against-all comparisons and "REF" for comparison against a reference database. For the three new modes, DEREP, DECONT, BLAST just implement the function with default settings and without any further user options in the GUI. This will allow to run some basic tests of the new functions, and will make it easier for Stefanos to integrate them in the new GUI later ... and once he designs the new GUI, the user options (like setting similarity percentages) can be added.

iTaxoTools / TaxI2

Add new program mode: dereplicate #20