expose minOverlap parameter

nbokulich commented 5 years ago

Improvement Description exposing minOverlap would allow adjustment of this parameter, which has been a bit of a bugbear for some users.

Current Behavior minOverlap is hardcoded at 20

Proposed Behavior Expose minOverlap but leave the default=20

References forum xref

benjjneb commented 5 years ago

Current Behavior minOverlap is hardcoded at 20

An important note, as part of the update to 1.10 or the R package, the minOverlap default value change from 20 to 12, so that is now the "hardcoded" value.

Maybe it is still worth exposing this, but the maximum benefit here is very low, as one should never go below minOverlap=4 as this opens up to a lot of FP mergers.

nbokulich commented 5 years ago

An important note, as part of the update to 1.10 or the R package, the minOverlap default value change from 20 to 12, so that is now the "hardcoded" value.

thanks @benjjneb ! I was not aware of that change.

Maybe it is still worth exposing this, but the maximum benefit here is very low

I agree, probably not a priority but we have had a number of users ask about this so it may be worthwhile to give control.

one should never go below minOverlap=4 as this opens up to a lot of FP mergers.

we can set 4 as the minimum overlap; an error will be raised if users try to go lower. (an explanation can be given in the description)

benjjneb commented 5 years ago

If it's sufficiently requested then OK.

Would it be possible to create a milestone linked to the next Q2 release? There are .couple other things I want to add to the plugin for the next release now that 1.10 is in (especially pooling/singletons) and would help a lot to have them organized by a milestone with reference to a (tentative) date for the next Q2 release.

nbokulich commented 5 years ago

I have added to the 2019.7 release project page — we have been using projects to organize release goals (and release dates and details are available on that page). We have not been using the milestones feature, but you are welcome to use that feature if it helps you organize issues for q2-dada2. Thanks!

colinbrislawn commented 5 years ago

I'm gamely interested in working on this issue.

EDIT: The maxMismatch should be included in this PR as well. These settings are related and powerful!

Oddant1 commented 5 years ago

Exposing the min_overlap parameter and defaulting it to 12 (which does appear to be the default value in dada2 after a cursory look) causes the tests to fail, and I was unable to find a value of min_overlap that did pass the tests. The table we get as a result of the command with min_overlap always has more nonzero elements than expected.

emford commented 4 years ago

Ill check into this and see whats going on with the tests.

fconstancias commented 4 years ago

Adding a few more options to the q2-dada2 plugin - R script which is running when calling the q2-dada2 plugin - can be easy and lead to a fine-tuned /state-of-the art dada2 pipeline whitin qiime2.

In addition to the minOverlap, I think that adding truncQ and minlen parameters can help on specific datasets. As @benjjneb mentionned, adding a parameter to specify the pooling strategy : "pseudo", TRUE instead of default FALSE will allow detection of singletons. Having a closer look at the R script running behind q2-dada2 plugin, I also find it more logical to add the option randomize = TRUE as default in the learnErrors function. My understanding is that doing so, samples are randomly added until enough bases/reads are loaded in order to learn the error rates of the dataset instead of adding samples according sample names alphabetic order which can correspond to specific samples of the entire dataset. Is this correct @benjjneb ?

Last one, adding the possibility to export plots from the plotError function could also help people to confirm that everything went well and that error model fits the data. Also, exporting read/ASV length distribution after read merging (as shown here) can also provide a positive feeling that everything went well.

I can help to add those options if you agree they are helpfull.

benjjneb commented 4 years ago

'truncQ' is almost always a superfluous parameter in my experience.

randomize=TRUE has the downside of giving non-identical results when the pipeline is re-run, so for total reproducibility randomize=FALSE is the current default in the R package.

I hope to have a pull request to add pseudo-pooling up today, sadly probably too late for the imminent release though.

The read length stats and plotErrors output definitely can be useful diagnostics in some datasets.

emford commented 4 years ago

So I started on this but my MacAir wont run the tests due to not enough RAM ...:coffin:

qiime2 / q2-dada2

expose minOverlap parameter #117