qiime2 / q2-dada2

QIIME 2 plugin wrapping DADA2
BSD 3-Clause "New" or "Revised" License
19 stars 36 forks source link

Upgrade to DADA2 >= 1.7.3 #85

Closed thermokarst closed 5 years ago

thermokarst commented 6 years ago

As noted here, when that new release lands, we should update the pinned version here, as well as revert the SSE changes made in January 2018 in the wake of SSE-gate.

thermokarst commented 6 years ago

@benjjneb, can you let us know which bioconductor release cycle you expect this to be available in, that way we can track it in the appropriate QIIME 2 release cycle. Thanks!

benjjneb commented 6 years ago

This won't get to release until the next Bioconductor release (BioC 3.7) which will be around April. Given that there will also be a delay in the BioC release propagating to bioconda, this would make sense for a June/July Q2 release.

benjjneb commented 6 years ago

Still waiting on the May Bioconductor release to propagate to bioconda. I think progress will show up on this issue: https://github.com/bioconda/bioconda-recipes/issues/8947

Once it migrates, I'm planning to update the R scripts, and to address several other outstanding issues at the same time including #87, #93, #97, #99

And note to self: Add an extra step catching and removing very low depth sequences before learning the error rates.

ebolyen commented 6 years ago

@benjjneb, in the new granular pipeline scheme, would it be possible to provide a way to handle sample pooling? This came up recently in the context of breakaway and alpha diversity estimation.

breakaway in particular requires singletons to exist, if there was a way to conditionally return FeatureTable[Frequency % Properties('singletons')] we would be able to enforce that assumption (when sample pooling is true or pseudo at least).

We're going to be trying to address TypeMap and similar typing related things in the next release cycle. If I need to implement a way to dependently type the output, that should be possible (especially with a real use-case now).

benjjneb commented 6 years ago

@benjjneb, in the new granular pipeline scheme, would it be possible to provide a way to handle sample pooling? This came up recently in the context of breakaway and alpha diversity estimation.

Yes. Although given the frustrating speed penalty that bioconda-dada2 has, pseudo-pooling might make more sense.

if there was a way to conditionally return FeatureTable[Frequency % Properties('singletons')] we would be able to enforce that assumption (when sample pooling is true or pseudo at least).

We're going to be trying to address TypeMap and similar typing related things in the next release cycle. If I need to implement a way to dependently type the output, that should be possible (especially with a real use-case now).

Have to admit, I don't really follow this. Basically option to a different return type if pool=TRUE (or pool=psuedo)?

ebolyen commented 6 years ago

Yep! Basically we'd just figure out when that property can be added, and tools that need it can require it as an input. Everything else will continue not caring like normal.

epruesse commented 5 years ago

Bioconda is now at 1.10: http://bioconda.github.io/recipes/bioconductor-dada2/README.html

@benjjneb Is the speed penalty still there?

This blocks qiime2/q2-alignment#64

benjjneb commented 5 years ago

Bioconda is now at 1.10: http://bioconda.github.io/recipes/bioconductor-dada2/README.html

Interesting. It would actually be easier to just jump ahead to 1.10 as the edge case merging bugs were mostly worked out between 1.8 and 1.10.

@benjjneb Is the speed penalty still there?

It will be a little better than now, but as far as I coiuld tell the speed penalty is pretty hard to overcome completely, because it is a result of how bioconda compiles code (old version of gcc, low levels of optimization to handle generic hardware).

thermokarst commented 5 years ago

old version of gcc

I wonder if the recent bump to gcc7 on conda-forge and bioconda will help us out here...

epruesse commented 5 years ago

That's what I meant. If it's not -march dependent, then it should now all be OK.

Having -march=native compiled stuff won't work for Bioconda. We've had the discussion in a few places, but just picture a user with miniconda in $HOME and doing qsub on a cluster with a few generations of compute nodes added over time. Even on Travis you'll sometimes run into different CPUs.

The reason why I can't easily build a SINA version that matches the pinned libs is that we followed Conda-Forge in the move to GCC7. Since that brought the ABI change required to support C++11, all C++ libs except for libstdc++ are incompatible with the old ones.

benjjneb commented 5 years ago

I wonder if the recent bump to gcc7 on conda-forge and bioconda will help us out here...

Didn't realize this had happened! I will definitely re-profile the speed issue with the 1.10 bioconda builds. fingers crossed would be awesome if the performance delta narrows.

epruesse commented 5 years ago

Didn't realize this had happened!

It's still happening I suppose. We are rebuilding more or less on an as-needed-basis because of resource constraints. Conda-Forge had been rebuilding for a while, and this January did "the label switch". That means all the GCC7 package previously in conda-forge/label/gcc7 became available through conda-forge. We tagged our pre-gcc7 state and started rebuilding. So far no major disasters have happened fingers crossed.

benjjneb commented 5 years ago

@epruesse A brief update, we have started getting segfault error reports for the dada2 package that appear to be coming from conda-installed versions of the 1.10 package, i.e. the new package version that is being built with GCC7. See for example: https://github.com/benjjneb/dada2/issues/684

benjjneb commented 5 years ago

I've just run some initial tests on the bioconda version of DADA2 1.10 built with GCC7, and it appears that the speed penalty versus natively installed DADA2 is now essentially gone.

I will do some more detailed testing, but this would be a major quality-of-life improvement for folks using DADA2 via QIIME2 (up to a 10x speedup) and strongly suggests skipping the 1.8 package version and going straight to 1.10. Is that in the realm of possibility for the next Q2 release?

apcamargo commented 5 years ago

@benjjneb Did you solve the segfault error or it didn't happen with your samples?

ebolyen commented 5 years ago

Is that in the realm of possibility for the next Q2 release?

Yep! There's still a full month or so on the next release. The dates are also probably something we can flex if it comes down to it, the speed improvement would certainly be worth it.

benjjneb commented 5 years ago

@benjjneb Did you solve the segfault error or it didn't happen with your samples?

Works fine in my initial testing, but I haven't tried to reproduce the reported bioconda-specific segfault yet. Will try soon.

Yep! There's still a full month or so on the next release. The dates are also probably something we can flex if it comes down to it, the speed improvement would certainly be worth it.

OK, I am going to move forward with testing and updating the Q2 scripts based on package version 1.10 then. I'll need help from your side figuring out the Q2 conda recipes and any python updates to the plugin though.

ebolyen commented 5 years ago

I'll need help from your side figuring out the Q2 conda recipes and any python updates to the plugin though.

Sounds like a plan! Feel free to structure the R scripts into as many or few as you need. I can adapt the Python side to work correctly. I think we had talked about breaking up the denoise-* methods into multiple actions before (and maybe even including taxonomic assignment?). We have pipelines now, so I can re-compose the original denoise-* methods in Python if you think that's worth attempting with this.


Re segfault: Is it possible that it's just a blanket runtime conflict? e.g. glibc on CentOS 5 vs anything compiled in the last decade?

benjjneb commented 5 years ago

Just wanted to ping this thread with the DADA2 issue that is open on segfaults with bioconda-dada2-1.10: https://github.com/benjjneb/dada2/issues/684

In testing on my local machine, the bioconda install of 1.10 seems fine, but clearly there is something going on here because there are multiple people with the same report.

epruesse commented 5 years ago

It might be not just 1.10.

(segfault in 1.8) https://github.com/bioconda/bioconda-recipes/pull/13847 (just references) https://github.com/bioconda/bioconda-recipes/issues/13776

In 1.8, it appears to have been a libc++ issue? I can't tell. We need someone proficient in R for this.

epruesse commented 5 years ago

Ok updates - my best guess is a concurrency bug somewhere in dada2. See the issue at dada2 repo.