expose dada parameters band_size and kdist_cutoff

sjanssen2 commented 1 year ago

Summary

@julianselke found that parameters for heuristics in DADA2 are not exposed to the user although the authors recommend to adjust them depending on data characteristics. This PR is to expose both parameters to qiime2 users.

Details

The error model used in the denoising step of DADA2 is based on transition probabilities between aligned nucleotides. Since the computation of these alignments is the most time consuming step, two heuristics are employed: A cutoff for kmer-distances between reads and the bandwidth of the Needleman-Wunsch alignment. The authors state: The default value of this parameter was chosen to minimally impact the alignment of sequences with few indels, such as ribosomal RNA genes. Both heuristics can be disabled by the user, and the default values should be re-examined if the algorithm is applied to genetic regions with significantly different characteristics, such as the indel-rich ITS region. The wrapper denoise.py takes an argument for the band size only, but it is not handed to the call of https://github.com/qiime2/q2-dada2/blob/master/q2_dada2/assets/run_dada.R. Hence, the function dada runs on default parameters for band size and kmer-distance cutoff when called by denoise-paired. For the commands denoise-pyro and denoise-css, however, the band size is set to 32, showing that we are aware of this issue. None of our methods take these parameters from the user.

julianselke commented 1 year ago

Conclusion after testing:

We might have been confused by the statement in the paper:

“Both heuristics can be disabled by the user, and the default values should be re-examined if the algorithm is applied to genetic regions with significantly different characteristics, such as the indel-rich ITS region.”[1]

In the documentation of the R package DADA2, however, the authors state:

“DADA is applied to sequencing technologies with high rates of indels, such as 454 sequencing, the BAND_SIZE parameter should be increased.”[2]

DADA2 aims to treat technical variation (noise/sequencing errors), not biological variation introduced via mutation. Therefore, the parameters should depend on the error characteristics of the sequencing technology, not the marker gene.

It is not necessary to expose these parameters to users working with Illumina sequencing data.

[1] Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJ, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods. 2016 Jul;13(7):581-3. doi: 10.1038/nmeth.3869. Epub 2016 May 23. PMID: 27214047; PMCID: PMC4927377

[2] https://www.bioconductor.org/packages/devel/bioc/manuals/dada2/man/dada2.pdf#Rfn.setDadaOpt.1

benjjneb commented 1 year ago

Just to chime in, our later guidance (as per the R package documentation) is more correct. The appropriateness of these heuristics depends on the nature of the errors introduced by the sequencing tech/PCR, and so (may) be revisited when changing sequencing technologies. Their appropriateness does not depend on the locus under investigation.

IMO, it is unnecessary to expose these parameters via the Q2 plugin. As implemented, there are different actions for different sequencing technologies, with appropriate heuristics set there. And if someone is going above and beyond -- like applying DADA2 to some new sequencing tech -- they probably should be working with the R package directly anyway.

gregcaporaso commented 1 year ago

Thanks for the work on this @sjanssen2 and @julianselke, and thanks for ~chiming~ qiime'ing in @benjjneb!

Based on the feedback from @benjjneb, and after some internal discussion on my team, we agree with @benjjneb that our approach should be to continue to support different technologies through specific actions (e.g., denoise-pyro, denoise-ccs, ...) after parameter settings have been tested with DADA2 directly, so we think it's better to not merge this one. Thank you again, @sjanssen2 and @julianselke, for doing the research, implementation, and testing on this. We're always happy to discuss these types of changes up front.

qiime2 / q2-dada2