tjcreedy / metamate

Your metabarcoding friend! Filter erroneous and unwanted amplicons.
GNU General Public License v3.0
6 stars 1 forks source link

Issue with translation - problematic inference of indels at beginning of sequences #6

Open naurasd opened 2 years ago

naurasd commented 2 years ago

Hi @tjcreedy

I have a dataset of ~28,000 non-singleton ASVs (311 - 315 bp in length) in 72 libraries derived from amplicon sequencing of a 313 bp fragment of COI. ASVs were generated running the dada2 pipeline for filtering, denoising and chimera removal. Running metamate, I am getting a lot of highly prevalent ASVs identified as verified non-authentic based on translation, including my most prevalent ASV (based on read abundance).

I am running metamate find @args.txt

where args.txt looks as follows:

-A COI_nochim_nosingle_ASVs.fa -M ASV_counts_nosingle.txt -S specifications.txt -R midori_coi_gc5.fasta --table 5 -n 311 -x 315 -o outputdir --overwrite.

In specifications.txt, a single parameter of [library; n; 3-10/8] is defined.

The reference fasta is the MIDORI COI reference set subset to the phyla Arthropoda, Nematoda and Mollusca - so taxa for which the NCBI's translation table 5 is suited for.

3908 ASVs were identified as verified non-authentic based on translation. However, I had my doubts, seeing that my most prevalent ASV (based on read abundance) and some other highly prevalent ones were included there.

So I took a closer look at ASV1 (the most prevalent one, 312 bp in length, see file attached asv1.txt). I blast-ed this sequence against the NCBI and BOLD databases. In BOLD, I got 4 hits of 100% similarity with sequences of the sponge Hemimycale arabica. In NCBI, I got a 100% query cover / 98.7% identity hit (Porifera sp. voucher) and three 78% query cover / 100% identity hits (Hemimycale arabica). So I was pretty sure I had a valid, non-NUMT sequence there. I ran metamate again, this time with tranlsation table 4 and a reference set of MIDORI subset to taxa suited for this table. I thought ASV1 might have been identified as non-authentic cause I used translation table 5 instead of 4 (it seems the latter is suited for Porifera). But exactly the same ASVs got identified as non-authentic based on translation again.

So I aligned ASV1 - ASV39 just as a test (see attached, test.txt) using MACSE (see NT and AA files attached, test_AA.txt; test_NT.txt) and transaltion table 5. These ASVs exhibit lengths of 311 -313 bp. There are no internal stop codons or indels, as you can see. Because 313 is not a multiple of 3, 2 frameshifts are identified on the nucleotide level at the beginning of 313 bp sequences (marked as "!"). Hence, these sequences show a frameshift on the amino acid level at the beginning of the sequence but are not identified as non-authentic ASVs by metamate. ASVs with a length of 312 are aligned with a frame-preserving gap at the beginning and ASVs with a length of 311 with a frame-preserving gap and a single-nucleotide frameshift at the beginning. All of these ASVs with a frame-preserving gap at the beginning are identified by metamate as non-authentic. I assume because these gaps show as deletion in the amino acid alignment and therefore count as indels.

Is there anything I can change in the settings to ignore these indels (or stop codons) at the beginning of sequences? Because I don't think these ASVs should be classified as non-authentic.

Appreciate your feedback.

Nauras