iTaxoTools / TaxI2

Calculation and analysis of pairwise sequence distances
GNU General Public License v3.0
0 stars 0 forks source link

Add additional mode: DECONT2 #31

Closed mvences closed 2 years ago

mvences commented 2 years ago

In this mode , the program works with one query and two reference data sets. I think in some previous emails I erroneously designated it as "DEREP2", but the correct designation would be DECONT2 as the idea is to perform a decontamination, based on a double comparison.

In brief, the program compares each query sequence in turn with all sequences of reference dataset 1, and then with all sequences from reference dataset 2. It removes those sequences that find a stronger match (smaller distances) in reference dataset 2 than in reference dataset 1.

In the current tkinter GUI, this extra option should be added as DEREP2 with an additional radiobutton, even if it means users need to increase the size of the GUI a bit to see and have access to all the radiobuttons and read all the text. The GUI also needs to include a third option to upload a file: currently users can upload a query file and a reference file. We need a third button/field to upload a second reference file.

In this mode, the program again takes one after another the sequences from the query sequence file. For a given query sequence, it perfoms a comparison with all sequences in reference sequence 1, and stores the score of the best match found. Then, it performs for the same query sequence a comparison with all sequences from reference file 2, and again stores the score of the best match. As default, the program will do this based on alignment-free distances. "Match score" then means the alignment-free distance found by the program. The user can also select classical alignment, in this case, the "match scores" are genetic distances. Then the two matches are compared. We will define reference file 1 as the "outgroup" and reference file 2 as the "ingroup". If a sequence has a closer match (smaller genetic distance) to the outgroup than to the ingroup, then it is considered to be a contamination and is moved to the "excluded sequences" output file. If a sequence has a closer match to the ingroup than to the outgroup, then it is kept and will be part of the "included sequences" output file. As with DECONT and DEREP, the output files will automatically written to disk, because the query files potentially can be very large and it is not realistic to keep the potentially huge output files in memory.