hildebra / lotus2

Amplicon sequencing pipelines suitable for SSU (16S, 18S), LSU (23S, 28S) and ITS.
http://lotus2.earlham.ac.uk/
GNU General Public License v3.0
52 stars 17 forks source link

Which pipeline results should I trust ? LotuS2 - DADA2 or stand alone DADA2 output? #49

Closed Balaveer closed 1 year ago

Balaveer commented 1 year ago

Hi @hildebra,

Thank you and your team for developing the LotuS2 pipeline. Recently, I came across this pipeline as it has many options for clustering and denoising algorithms. I am interested in DADA2 denoising algorithm. I read the manuscript and thought of an interesting conceptual question. It is about the midQual reads after the quality filtering will be used for 'Backmapping onto ASVs'. My question is, doesn't it inflate the abundances of the ASVs which might be PCR or sequencing errors ? Further, this will affect the downstream microbial ecological metrics. Two of our collaborators used DADA2 denoising on the same data but with different pipelines, one with stand alone DADA2 and other group with LotuS2. The results are different no. of sequences, ASVs and finally different diversity estimates. Some of us are confused, which pipeline to go for and actually which one is correct ? I would appreciate your reply and suggestions. Thank you!

Best reagards, BalaVeera

hildebra commented 1 year ago

Hey BalaVeera,

the difference in number of ASVs is not related to the backmapping. This is related to how we quality filter data in my experience. The backmapping usually doesn't affect much of the overall data, and it is applied to reads that fail the quality scores - chimeric / non-chimeric reads is independent of this score and will be filtered out later at the ASV level (removing the complete ASV that seems chimeric). Backmapping is capped at 99% identity for ASV backmapping, these reads are virtually identical to the ASV sequence. So I don't see a theoretical reason for why this shouldn't work.

(https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-022-01365-1) E.g. if you look at Fig. 2-3 you will see that the DADA2 vs Lotus2-DADA2 results are very different. Here we compared the vanilla dada2 to the lotus2 implementation (to clarify: in both cases the dada2 algorithm is the same, but in LotuS2-DADA2 we do the prefiltering in LotuS2, and we might use other dada2 parameters than your colleagues). Figs.2-3 show that LotuS2-DADA2 does outperform vanilla DADA2. Fig 4 tells the same story, but this time using a mock community where we do know community composition. What was most interesting to me personally in this comparison was however the alpha diversity predicted in each pipeline. To cite the paper:

"The mock community consisted of 49 bacteria and 10 archaea [53], with a total of 128 16S rRNA gene copies included in their genomes. If multiple 16S copies occur within a single genome, these can diverge but are mostly highly similar or even identical to each other [57]. Thus, the expected biodiversity would be 59 OTUs and ≤ 128 ASVs. Notably, the number of mothur and QIIME 2-Deblur TP ASVs/OTUs exceeded this threshold (N = 370, 198, respectively), indicating that both pipelines overestimate known biodiversity. DADA2, QIIME 2-DADA2, and PipeCraft 2-VSEARCH generated more ASVs than expected per species (N = 94, 122, and 90 respectively), but this might be explained by divergent within-genome 16S rRNA gene copies. LotuS2 was notably at the lower end in predicted biodiversity, predicting between 53 and 61 OTUs or ASVs in different clustering algorithms (Supplementary Table S4). However, these seemed to mostly represent single species, covering the present species best among pipelines, ...". I found it astonishing how much most pipeline were overpredicting actual diversity.

So in the end I think it's a call everyone has to make themselves. Of course I would say that as pipeline developer, but for me personally, I'm really convinced that LotuS2 does give more accurate results.

hth, Falk

Balaveer commented 1 year ago

Thank you very much for your quick and detailed response, @hildebra . I greatly appreciate it! I haven't used the LotuS2 pipeline so far, and I'm considering testing the pipelines on a subset of the data to see where the exact differences arise. I've looked at the LotuS2 documentation and have a couple of questions.

Our data contains MiSeq PE reads, but both R1 and R2 reads have a mixed orientation of sequencing reads (some starting [primer] with 5′-to-3′, others with 3′-to-5′ direction). I assume the SDM options "CheckForMixedPairs" and "CheckForReversedSeqs" can handle this, correct? Or is there a different approach recommended?

My second question is more of a request for your opinion. If I choose the dada2 clustering option in the LotuS2 pipeline, then after denoising the Fwd and Rev reads, the pairs will be merged. However, what if I merge the Fwd and Rev reads before denoising and then use the merged reads (as single-end reads) as input for ASV clustering? Would you recommend this approach? Would you expect significant differences?

Thanks, BalaVeer

hildebra commented 1 year ago

Hey BalaVeer, Glad I could help! yes the CheckForMixedPairs was programmed in for that reason. about the merging of reads: there is good empirical evidence that the read error profile substantially increases in merged reads; hence we only use read1 for OTU/ASV clustering, because you really, really want to avoid any error sources, otherwise your alpha div will go unreasonably up. However, merged reads are really useful for assigning tax info, hence we merge after we determine the number of reads. It's a bit counterintuitive but a major sources of errors (using both read pairs for OTU/ASV clustering). You can also read about this in the Uparse paper from Robert Edgar. hth, Falk

Balaveer commented 1 year ago

Thanks for your quick reply @hildebra. Regarding the merging PE reads, since the length of the merged sequence will be longer than R1 / R2 reads, the probability for error profile will increase - this I can understand. But, if one use only read1 for ASV clustering then obviously they would lose all the sequence variation in the read2, isn't it ? This would then under represent the biological variation, especially for the amplicons of length >300 bp produced on Illumina platforms . Am I missing something ? Or how this potential issue is addressed in LotuS2 ?

Best, BalaVeer

hildebra commented 1 year ago

Hey BalaVeer, Often enough read1 variation is enough to classify the most obvious species level variation. Going below species level with 16S data is a bit of a .. let's say controversial concept. Read2 has more errors than read1 (quite a few more). This leads to inflated alpha diversity, hence it's safer not to include this. You might indeed catch additional bio variation on read2, but the chances of "catching" a false positive is higher according to simulations. The question is hence: do you prefer to get potentially additional variation on read2, at the cost of getting more false positives? If you're interested in alpha diversity, you don't want the first case.

hth, Falk

Balaveer commented 1 year ago

Thank you very much for your response and opinion @hildebra . They are helpful in understanding how LotuS2 works and the logic behind! May be, I'll get back to you in future, if I come up with any questions in using LotuS2.