Open hjarnek opened 3 months ago
Hi, thanks for the reference to MACSE. However, I do not understand how that tool could fit into the pipeline. We do not produce alignments. Could you elaborate?
@d4straub MACSE is an aligner with many applications, one of which is to detect pseudogenes among protein-coding marker genes.
Excerpted from the paper:
The enrichAlignment subprogram can be used to sequentially add new DNA sequences to an existing alignment. Its input parameters allow defining criteria that the additional sequences should fulfil to be actually incorporated into the final alignment. For instance, sequences can be automatically discarded when, once aligned, they would contain a stop codon, too many gaps, or more than a given number of frameshifts. [...] This [...] is especially useful for metabarcoding projects based on markers such as the mitochondrial Cytochrome Oxidase subunit I (cox1) gene. This typically involves enriching a reference alignment containing sequences from databases such as BOLD or MIDORI with thousands of newly generated sequences.
Creating those reference alignments they talk about is already done (for COI, rbcL & matK). They are available for different genetic codes and taxonomic groups from here: https://www.agap-ge2pop.org/barcoding-alignments/
This approach would improve on ampliseq's stop codon filtering by automatically detecting the correct ORF, taking more genetic translation tables into account, and also detect putative nuclear mitochondrial pseudogenes (nuMTs) through frameshift and gap analysis. Quite a desirable upgrade for COI analyses.
Thanks for the reply! Here are 3 more questions:
@d4straub
Thanks! Because I am as well limited in time and have at the moment no projects relating to protein-coding marker genes, I cannot justify investing time into adding MACSE currently. As I said above, I could give advise and reviews if anyone wants to give it a shot.
Alright, I see. Let's keep it hanging for now then.
Description of feature
The current stop-codon detection in ampliseq could be improved and supplemented with frame-shift detection by implementing MACSE. There are existing Nextflow pipelines here. This is a step that is increasingly recommended for protein-coding marker genes.