qiime2 / q2-dada2

QIIME 2 plugin wrapping DADA2
BSD 3-Clause "New" or "Revised" License
19 stars 36 forks source link

Update for dada2 package version 1.10 #113

Closed benjjneb closed 5 years ago

benjjneb commented 5 years ago

This new branch updates the R scripts to work with version 1.10 of the dada2 R package. However, these scripts will break when using the dada2 R package version 1.6 as currently used by Q2.

This will need to be coordinated with recipe upgrades to move to the now available bioconda build of dada2-1.10.

This will close #85

The bioconda build of dada2-1.10 has migrated to the GCC7 compiler, which has resulted in major performance improvements. In my testing there is no longer any significant speed penalty with a native installation of dada2, and Q2 users of DADA2 will see 2x-20x speedups just due to this upgrade.


This is working on my end within a local conda environment in which dada2 1.10 was installed as so:

conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels defaults
conda create -n dada2-1.10 python=3 anaconda
source activate dada2-1.10
conda install -c bioconda bioconductor-dada2=1.10.0

However there are ongoing reports of segfaults for some users that appear to be specific to bioconda installs of dada2-1.10 to be aware of: https://github.com/benjjneb/dada2/issues/684 and https://github.com/qiime2/q2-dada2/issues/85

thermokarst commented 5 years ago

Woohoo! I am happy to review this once you give the go-ahead!

thermokarst commented 5 years ago

@benjjneb, how do you feel about trying to get this into the upcoming 2019.4 release? We are available on this end to help out as necessary. Seems like we might be out of the woods on the segfault issue :crossed_fingers: ...

benjjneb commented 5 years ago

@benjjneb, how do you feel about trying to get this into the upcoming 2019.4 release?

Yes, but I see that the deadline is very soon...

Submit PRs by: 07/22 Merge PRs by: 07/26

I had a more ambitious goal for this release to also add several new features requested in the issues (e.g. pooling options that would allow singleton detection). However that was paused after the segafault issue came up, and given the short time, I think just moving to 1.10 and putting off the functional additions to the next release probably makes sense. That is still a big and very noticeable speedup (2-20x), and the SSE2+seqlen>260 issue that came up before makes me leery of rushing new functionality to the plugin which is a bit harder to make workaround for then the R package.

One important issue, there have been some relatively minor, but non-zero, functionality changes between 1.6 and 1.10 in default dada2 operation. Thus, output will not be identical, and some of the previous tests will fail. How to handle that in this compressed timeframe?

On my end, I need to update the denoise-paired code ASAP to work with 1.10. I can guarantee by end of day Saturday, but hopefully tomorrow. Then I think its important that the code runs all the tests without error at least, and that it gets a review from you guys. Does that seem reasonable?

Thanks so much for your work on the segfault. I don't know if I ever would have solved that one, I still only partially follow what happened!

thermokarst commented 5 years ago

Thanks for the update @benjjneb.

Yes, but I see that the deadline is very soon...

We just updated the release schedule, how does merge by 04/26 sound?

I think just moving to 1.10 and putting off the functional additions to the next release probably makes sense

Ditto, let's shoot for that. Let us know what we can do.

How to handle that in this compressed timeframe?

If you update the scripts, @ebolyen or myself can update the tests. Would that work?

On my end, I need to update the denoise-paired code ASAP to work with 1.10. I can guarantee by end of day Saturday, but hopefully tomorrow. Then I think its important that the code runs all the tests without error at least, and that it gets a review from you guys. Does that seem reasonable?

:+1:

Thanks so much for your work on the segfault. I don't know if I ever would have solved that one, I still only partially follow what happened!

Thanks for letting us dig in, that was a fun one!

benjjneb commented 5 years ago

I think the R scripts are good to go, but as per previously I'm running them with by-hand Rscript invocations rather than through the plugin infrastructure itself. Once version 1.10 of the package from bioconda is specified in the plugin then all the tests should complete, but since some output has changed they won't all pass.

I don't think any python-side changes are needed outside the tests, as this is just a straight upgrade to use the 1.10 scripts. In my testing there are major speedups on larger datasets (i.e. datasets that take >10 minutes to run), but curious to hear that confirmed.

ebolyen commented 5 years ago

Awesome! Would it be ok if we push the test-changes to your branch/this-PR?

benjjneb commented 5 years ago

Would it be ok if we push the test-changes to your branch/this-PR?

Sure. Do I need to do anything to allow that?

ebolyen commented 5 years ago

Nope, it's enabled by default on our repos. Just didn't want to surprise you with mystery commits/fast-forward merges.

thermokarst commented 5 years ago

Okay, tests are all up to date. Travis will continue to fail due to having the wrong version of DADA2 in the the environment. I will review the functional changes next. @benjjneb, do the changes to the test data seem reasonable? Seems like there are slightly fewer counts in the tables, but otherwise it looks like just a bunch of minor shifting and shuffling to me.

benjjneb commented 5 years ago

Seems like there are slightly fewer counts in the tables, but otherwise it looks like just a bunch of minor shifting and shuffling to me.

That's exactly what is expected. The biggest difference affecting output is that not every read is denoised by default anymore, hence slightly fewer counts.