iTaxoTools / TaxI2

Calculation and analysis of pairwise sequence distances
GNU General Public License v3.0
0 stars 0 forks source link

Add new program mode: dereplicate #20

Closed mvences closed 3 years ago

mvences commented 3 years ago

I have been thinking about new functions that could be added to TaxI3 to make the program's "skills" applicable to additional tasks that often need to be performed during the analysis of DNA sequences. This issue describes the first of these, and I will create a second issue with a further such addition.

For this new addition, it will be good in the TaxI3 GUI to create a third "mode" (for now, we have "compare all against all" and "compare to reference", and now we would add "dereplicate" (and as a fourth mode, "decontaminate" for which I will add a further issue to describe it).

[I know the current GUI will become more and more cluttered with different options, but please don't worry about this - as discussed before, in the end Stefanos will add a totally new graphical interface to make this easy to understand and user-friendly]

The task to be performed is relatively simple: Often, our long lists of sequences will contain sequences that are identical or very similar to each other, and we want to "dereplicate" them, that is, keep only one of these repeated sequences. Because TaxI3 has now implemented the fast alignment-free distance calculation and the "Rust" improvement, it is perfect to perform such a task.

There are algorithms and tools published that can do such things, but they all have important shortcomings, so I think it will be better (and should be easy) to come up with something from scratch. The program should just go for an optimized "compare all against all" comparisons and save a new data file corresponding to the input file but with all replicate sequences excluded. Also, it should output one file with some general statistics (how many sequences excluded etc).

Output:

necrosovereign commented 3 years ago

I'm not sure how the alignment-free distance corresponds to similarity percentage

mvences commented 3 years ago

This is a good point, in fact, I also have no idea. Also, I am not 100% sure which of the alignment-free distance calculations of Alfpy you have implemented. So the best will be to test this empirically. Unfortunately since I am travelling, I cannot run the program myself.

I here attach an aligned fasta file of 100 bp while where the first sequence is a reference, the next two differ by 1%, the next two by 2%, the next two by 5%, the next two by 10%, and the last by 20%. Can you please run this as all-against all using the alignment-free option and send me the results? With this, it should be possible to correlate more or less how the alignment-free distance corresponds to the normal distance, even if I expect the correlation will not be exact when calculated for complex sequences.

By the way, I cannot exactly remember, has the alignment-free option already been implemented also for the "compare against reference" mode? I think yes but I cannot fully remember...

sequences_distance_test.fas.txt

necrosovereign commented 3 years ago

Yes, the alignment-free is already implemented for all modes.

Here is the alignment-free distance table that has been calculated from your file: sequence_distance_test_table.txt

mvences commented 3 years ago

Thank you for the table. So, with this data I have made the following quick and dirty correlation:

image

As I expected, the correlation is not exact. So we could give the user the following options which should be more or less working:

necrosovereign commented 3 years ago

I've attempted to implement this mode, but the new options don't fit into the window. I think the GUI needs to be reworked before new features can be added.

mvences commented 3 years ago

Yes, it is clear that the GUI is already incredibly full. However, I think it will take some weeks or even months before Stefanos can work on the new GUI since he will start with concatenator. So it will be better not to wait for this before adding the new functions.

Maybe the two following options would be possible:

Either just increase the size of the current tkinter-GUI and add the new features in a new, empty part of the size-increased GUI canvas (but only if this can be done without much effort, it is not worth to take a lot of time to change the tkinter GUI).

Or, alternatively, just add three radiobuttons for the new functions, with very short descriptions (DEREP for dereplicate mode, "DECONT" for decontamination, "BLAST" for external Blast, and if necessary to make some space, abbreviate the two exoisting modes "ALL" for all-against-all comparisons and "REF" for comparison against a reference database. For the three new modes, DEREP, DECONT, BLAST just implement the function with default settings and without any further user options in the GUI. This will allow to run some basic tests of the new functions, and will make it easier for Stefanos to integrate them in the new GUI later ... and once he designs the new GUI, the user options (like setting similarity percentages) can be added.