nextgenusfs / amptk

AMPtk: Amplicon ToolKit for NGS data (formally UFITS)
http://amptk.readthedocs.io/
BSD 2-Clause "Simplified" License
38 stars 14 forks source link

Inconsistent primer trimming sequence in amptk_mock*.fa #89

Closed peterjc closed 2 years ago

peterjc commented 2 years ago

As per #87, it is clear amptk_mock1.fa has been processed differently from amptk_mock2.fa and amptk_mock3.fa. It has no line wrapping, no species names, but does seem to have been left primer trimmed.

Consider mock7:

$ grep -A 1 mock7 amptk_mock1.fa
>mock7
AACGCACCTTGCGCTCCTTGGCATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAACTCTCAACCGAGCTTCGGTTAGATTTTGGAGGTTGCCGGACCGGGTTCGGCTCCCCTTGAATGCATAGTAAACCTTTGGACTCTGACCGACCGAGTGGACGTGATAGAAAAGTCACCGTCGACTGAAGGGTCCGTCGCTTGAACGGTTCAAAGCCTTGCTTCGTCGTCCTCGGACGAAAGCTTTATCTCTGACCTCAAATCAGGCAGGACTACCCGCTGAACTTAA

$ grep -A 7 mock7 amptk_mock2.fa
>mock7 |GDL-1|Laetiporus caribensis
GAAATGCGATAAGTAATGTGAATTGCAGAATCCAGTGAATCATCGAATCT
TTGAACGCACCTTGCGCTCCTTGGCATTCCGAGGAGCATGCCTGTTTGAG
TGTCATGAAACTCTCAACCGAGCTTCGGTTAGATTTTGGAGGTTGCCGGA
CCGGGTTCGGCTCCCCTTGAATGCATAGTAAACCTTTGGACTCTGACCGA
CCGAGTGGACGTGATAGAAAAGTCACCGTCGACTGAAGGGTCCGTCGCTT
GAACGGTTCAAAGCCTTGCTTCGTCGTCCTCGGACGAAAGCTTTATCTCT
GACCTCAAATCAGGCAGGACTACCCGCTGAACTTA

$ grep -A 7 mock7 amptk_mock3.fa
>mock7 |GDL-1|Laetiporus caribensis
GAAATGCGATAAGTAATGTGAATTGCAGAATCCAGTGAATCATCGAATCT
TTGAACGCACCTTGCGCTCCTTGGCATTCCGAGGAGCATGCCTGTTTGAG
TGTCATGAAACTCTCAACCGAGCTTCGGTTAGATTTTGGAGGTTGCCGGA
CCGGGTTCGGCTCCCCTTGAATGCATAGTAAACCTTTGGACTCTGACCGA
CCGAGTGGACGTGATAGAAAAGTCACCGTCGACTGAAGGGTCCGTCGCTT
GAACGGTTCAAAGCCTTGCTTCGTCGTCCTCGGACGAAAGCTTTATCTCT
GACCTCAAATCAGGCAGGACTACCCGCTGAACTTA

Curiously only amptk_mock1.fa has been left primer trimmed, but has an extra trailing A. Or looking at the other sequences, perhaps amptk_mock2.fa and amptk_mock3.fa are missing the second final A.

This can be shown by aligning the three sequences, here with muscle:

$ muscle -quiet -in mock7.fa
>amptk_mock1.fa-mock7
-----------------------------------------------------AACGCAC
CTTGCGCTCCTTGGCATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAACTCTCAACCG
AGCTTCGGTTAGATTTTGGAGGTTGCCGGACCGGGTTCGGCTCCCCTTGAATGCATAGTA
AACCTTTGGACTCTGACCGACCGAGTGGACGTGATAGAAAAGTCACCGTCGACTGAAGGG
TCCGTCGCTTGAACGGTTCAAAGCCTTGCTTCGTCGTCCTCGGACGAAAGCTTTATCTCT
GACCTCAAATCAGGCAGGACTACCCGCTGAACTTAA
>amptk_mock2.fa-mock7 |GDL-1|Laetiporus caribensis
GAAATGCGATAAGTAATGTGAATTGCAGAATCCAGTGAATCATCGAATCTTTGAACGCAC
CTTGCGCTCCTTGGCATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAACTCTCAACCG
AGCTTCGGTTAGATTTTGGAGGTTGCCGGACCGGGTTCGGCTCCCCTTGAATGCATAGTA
AACCTTTGGACTCTGACCGACCGAGTGGACGTGATAGAAAAGTCACCGTCGACTGAAGGG
TCCGTCGCTTGAACGGTTCAAAGCCTTGCTTCGTCGTCCTCGGACGAAAGCTTTATCTCT
GACCTCAAATCAGGCAGGACTACCCGCTGAACTTA-
>amptk_mock3.fa-mock7 |GDL-1|Laetiporus caribensis
GAAATGCGATAAGTAATGTGAATTGCAGAATCCAGTGAATCATCGAATCTTTGAACGCAC
CTTGCGCTCCTTGGCATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAACTCTCAACCG
AGCTTCGGTTAGATTTTGGAGGTTGCCGGACCGGGTTCGGCTCCCCTTGAATGCATAGTA
AACCTTTGGACTCTGACCGACCGAGTGGACGTGATAGAAAAGTCACCGTCGACTGAAGGG
TCCGTCGCTTGAACGGTTCAAAGCCTTGCTTCGTCGTCCTCGGACGAAAGCTTTATCTCT
GACCTCAAATCAGGCAGGACTACCCGCTGAACTTA-

The same is true of mock23.

The situation for mock22 is more curious with a Y ambiguity code in amptk_mock1.fa versus an N in amptk_mock2.fa and amptk_mock3.fa.

nextgenusfs commented 2 years ago

I will have to look back and get confirmation, but my recollection was that these sequences were from Sanger sequencing of the actual clones -- where we might have used a different forward sequencing primer and/or the Sanger data wasn't good. Surely that is where the Y came from (ie Sanger data was inconclusive at that position).

peterjc commented 2 years ago

Yes, the ambiguity is undoubtedly from Sanger capillary sequencing - I was puzzled that it was Y in one file, but the more vague N in another.

My main query was about the trimming being different between the three files under https://github.com/nextgenusfs/amptk/tree/master/amptk/DB - but I may not have understood why there are three files?

$ grep -c "^>" amptk_mock*.fa
amptk_mock1.fa:20
amptk_mock2.fa:26
amptk_mock3.fa:23

Given they have differing numbers of sequences, my assumption was three different mock communities. Quoting the paper,

To construct the biological mock community (BioMock) we selected 26 identified fungal cultures (Table S1) ...

and:

The BioMock-standards consist of an equimolar mixture of 26 PCR products thereby removing the PCR bias from mixed DNA samples, while the BioMock communities consist of an equimolar mixture of 23 single-copy plasmids.

That suggests amptk_mock2.fa is the BioMock-standards (equimolar mixture of 26 PCR products), and amptk_mock3.fa is the BioMock communities (equimolar mixture of 23 single-copy plasmids), leaving amptk_mock1.fa unexplained - perhaps a precursor dataset?

nextgenusfs commented 2 years ago

Hi @peterjc. Yes I think you have it figured out. I had used this to just keep track of what we were running when were were doing these experiments and writing this pipeline. amptk_mock1.fa was our first control sample we ran on Ion Torrent and once we saw results we added a few more species to capture some more variability and make sure we understood what was going on (It actually pre-dates the data in the publication, so at this point it probably shouldn't be in the repo other than for legacy/historical reasons).

So the numerical suffix is the iterations of the different "biological mock" communities that we ran. We had included a few clones in biomock2 that consistently did not sequence well on Ion Torrent for different reasons (mind you this is before we developed the synthetic mock) we then attempted to pare down to the 23 that had enough sequence divergence and give us some confidence in the data analysis pipelines.

The more times we ran this the more we got the idea that the biological sequences are problematic due to index-bleed, hence synthetic mocks as the final solution. I still recommend people run the synthetic mock along with a biological mock community that is representative of species/ITS sequences they are looking to differentiate. The other practical advice when using Illumina is just never re-use any of the index/barcodes and then "index-bleed" is actually quite low on MiSeq platform. I don't think many folks are using Ion Torrent anymore, so practical advice would be to buy the MiSeq instead of the Ion Torrent :).

peterjc commented 2 years ago

That makes sense - as much as we try to document as we go along, there will always be hidden assumptions it takes fresh eyes to see.

I found your work looking for other people doing metabarcoding with synthetic controls. Our paper plans are all delayed though lockdown, but many of our methodical conclusions agree (we're only using the MiSeq). Expect a citation to follow...

So, as to this issue (#89), amptk_mock1.fa was used in pilot data before the work described on the paper, so we shouldn't read too much into it differing from amptk_mock2.fa and amptk_mock3.fa. I will close this issue now, thank you.

nextgenusfs commented 2 years ago

Cool, looking forward to the publication! It took me awhile to remember the different mocks that we tried -- most of this work/sequencing was done in 2014-2016 time frame so hard to recall all the details!