nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
182 stars 115 forks source link

Add MACSE to enable frame-shift detection for COI #758

Open hjarnek opened 3 months ago

hjarnek commented 3 months ago

Description of feature

The current stop-codon detection in ampliseq could be improved and supplemented with frame-shift detection by implementing MACSE. There are existing Nextflow pipelines here. This is a step that is increasingly recommended for protein-coding marker genes.

d4straub commented 3 months ago

Hi, thanks for the reference to MACSE. However, I do not understand how that tool could fit into the pipeline. We do not produce alignments. Could you elaborate?

hjarnek commented 3 months ago

@d4straub MACSE is an aligner with many applications, one of which is to detect pseudogenes among protein-coding marker genes.

Excerpted from the paper:

The enrichAlignment subprogram can be used to sequentially add new DNA sequences to an existing alignment. Its input parameters allow defining criteria that the additional sequences should fulfil to be actually incorporated into the final alignment. For instance, sequences can be automatically discarded when, once aligned, they would contain a stop codon, too many gaps, or more than a given number of frameshifts. [...] This [...] is especially useful for metabarcoding projects based on markers such as the mitochondrial Cytochrome Oxidase subunit I (cox1) gene. This typically involves enriching a reference alignment containing sequences from databases such as BOLD or MIDORI with thousands of newly generated sequences.

Creating those reference alignments they talk about is already done (for COI, rbcL & matK). They are available for different genetic codes and taxonomic groups from here: https://www.agap-ge2pop.org/barcoding-alignments/

This approach would improve on ampliseq's stop codon filtering by automatically detecting the correct ORF, taking more genetic translation tables into account, and also detect putative nuclear mitochondrial pseudogenes (nuMTs) through frameshift and gap analysis. Quite a desirable upgrade for COI analyses.

d4straub commented 3 months ago

Thanks for the reply! Here are 3 more questions:

hjarnek commented 3 months ago

@d4straub

d4straub commented 3 months ago

Thanks! Because I am as well limited in time and have at the moment no projects relating to protein-coding marker genes, I cannot justify investing time into adding MACSE currently. As I said above, I could give advise and reviews if anyone wants to give it a shot.

hjarnek commented 2 months ago

Alright, I see. Let's keep it hanging for now then.