This would be one further program mode in addition to "Compare all against all", "Compare against reference", and "Dereplicate".
The "Decontaminate" mode would be a variant of the "Compare against reference" mode in which matches are removed from the input file.
The typical application for this mode is a set of sequences of a certain group of organisms, in which there are erroneously sequences of other non-target organisms. For example, one has sequenced a lot of frog samples but suspects that some of the sequences come not from the frogs themselves but of some pathogens or parasites of the frogs.
The program would then compare all sequences in the input file with a set of reference sequences, and remove all the matches.
The general procedure would be very similar to the "Compare to reference" mode.
Sequences from the input file that match one of the reference sequences are removed, and placed in a second file.
The "filtered" input file (with "contaminants" removed) is saved automatically as a new file, named as the original input file but with an added "_decontaminated" to the filename.
The sequences considered as contaminants are saved to a second file, named as the original input file but with "_contaminants" added to the file name.
A further output file is produced that is very similar to the regular output file of the "Compare to reference" file but marks in additional field (column) if a sequence has been recognized as "contaminant" and excluded.
As with the dereplicate mode, the way these output files are produced should be so that the program is able to deal with massive files, maybe even some GB in size. This means, better not write them to a temporary folder and preview them in the GUI (this option should be disabled) but directly save all output files to the same folder where also the input file is located.
User setting options:
As with the dereplicate mode, the user should be able to select a similarity at which a sequence is considered a contaminant and removed. Here I would say, the default should be at 99% similarity.
As a default the program would run this mode with the fast alignment-free comparison, but we should also allow for the option of using pairwise alignments (which is one of the novelties and strengths of TaxI).
Output files:
One file with the sequences from the main input file, minus all the sequences removed as "contamination".
This would be one further program mode in addition to "Compare all against all", "Compare against reference", and "Dereplicate".
The "Decontaminate" mode would be a variant of the "Compare against reference" mode in which matches are removed from the input file.
The typical application for this mode is a set of sequences of a certain group of organisms, in which there are erroneously sequences of other non-target organisms. For example, one has sequenced a lot of frog samples but suspects that some of the sequences come not from the frogs themselves but of some pathogens or parasites of the frogs.
The program would then compare all sequences in the input file with a set of reference sequences, and remove all the matches.
The general procedure would be very similar to the "Compare to reference" mode.
Sequences from the input file that match one of the reference sequences are removed, and placed in a second file.
The "filtered" input file (with "contaminants" removed) is saved automatically as a new file, named as the original input file but with an added "_decontaminated" to the filename.
The sequences considered as contaminants are saved to a second file, named as the original input file but with "_contaminants" added to the file name.
A further output file is produced that is very similar to the regular output file of the "Compare to reference" file but marks in additional field (column) if a sequence has been recognized as "contaminant" and excluded.
As with the dereplicate mode, the way these output files are produced should be so that the program is able to deal with massive files, maybe even some GB in size. This means, better not write them to a temporary folder and preview them in the GUI (this option should be disabled) but directly save all output files to the same folder where also the input file is located.
User setting options:
As with the dereplicate mode, the user should be able to select a similarity at which a sequence is considered a contaminant and removed. Here I would say, the default should be at 99% similarity.
As a default the program would run this mode with the fast alignment-free comparison, but we should also allow for the option of using pairwise alignments (which is one of the novelties and strengths of TaxI).
Output files: