conda installed dada2 is slow

benjjneb commented 7 years ago

When I run the q2-dada2 workflow (via Rscript, identical input data) on my machine, the processing time is 5x longer when run inside the q2 conda environment than when run in the standard shell (the native installation) using the same 1.4.0 version of the package.

More to come, but my initial suspicions is that when bioconda constructs the universal binary version of the package that no compiler optimizations are being performed. This could cause such a slowdown as the two performance critical parts of the algorithm are written to be auto-vectorized by gcc and llvm rather than being explicitly vectorized.

jairideout commented 7 years ago

Ouch! Thanks for investigating @benjjneb, let us know what we can do to help.

benjjneb commented 7 years ago

Did some profiling of execution time of the core denoising algorithm on a test dataset for both the conda-installed version of dada2 and the natively installed version of dada2 (both 1.4.0):

Q2 env: (time command) user: 25m37.428s b_compare: 21.80min -sub_new: 20.96min —kmer_dist: 11.57min —nwalign_vectorized2: 8.64min —-dploop_vec: 4.91m —Al2subs: 24.39s b_p_update: 1.23m b_shuffle2: 36.94s b_bud: 35.51s

Native: (time command) user: 3m15.447s b_compare: 1.77min -sub_new: 1.51min —kmer_dist: <50s —nwalign_vectorized2: 50.21s —-dploop_vec: <50s —Al2subs: 5.20s b_p_update: 32.81s b_shuffle2: 9.51s b_bud: 16.58s

This indicates that the slowdown is even worse than 5x (could approach 20x) in the compiled parts of the code base critical to performance. Datasets with larger samples will spend more time in that part of the code, and approach a 10-20x slowdown.

The biggest slowdowns are in the bits that need to be auto-vectorized, supporting the idea that this is an issue with the compiled code being generated by the bioconda build.

Next step... how to get bioconda to appropriately compile the package!

benjjneb commented 7 years ago

Right now the feedback from the bioconda issue is not terribly promising. The problems seem to be that conda is reliant on an old version of gcc (4.8.5) that probably does not auto-vectorize well (at all?), and secondly that binaries are being constructed for the lowest-common denominator architecture -- so no SSE vectorization.

ebolyen commented 7 years ago

Something we could potentially do is create our own conda builds of dada2 where we would by default install a "modern" architecture (compiled appropriately), but we could have a different package that users could install otherwise.

This is kind of what continuum does with numpy/scipy, they have an nomkl metapackage that installs a different set of packages if you don't have architecture support.

benjjneb commented 7 years ago

Update: I've branched a version of the package with one critical piece rewritten with SSE intrinsics (with the hope that will fix the very-old-compiler problem) and am testing a new conda recipe that uses that branch. The Q2 R scripts can also be modified to forgo our vectorized aligner and instead use the non-vectorized aligner (faster when there is no vectorization).

We'll see what happens, but in theory those changes could push the performance degradation from (as high as) 20x to more like 3x.

benjjneb commented 7 years ago

I have a working local conda build based on the explicit-SEE branch of the dada2 package: https://github.com/benjjneb/dada2/tree/explicit-SSE

On my machine this reduces the performance penalty of conda-installed dada2 vs. native dada2 from 10x to 3x on the dataset I am testing on, so a big speedup in practice.

What do people think is the best way to proceed in terms of getting this working for Q2 itself? I am leaning towards creating a new bioconda recipe (e.g. dada2-for-qiime2) that is built off this SSE feature branch combined with an update to the q2-dada2 R scripts. That is probably the fastest way to roll these improvements out.

Medium term, the explicit SSE branch will hopefully make it into the next official Bioconductor release (1.6) and once that makes it onto bioconda the special qiime2 recipe could be retired.

Long term, it is hard to see the bioconda build ever getting that close to performance parity with a native installation. Not sure what to do about that.

ebolyen commented 7 years ago

@benjjneb thanks so much for looking into this!

On my machine this reduces the performance penalty of conda-installed dada2 vs. native dada2 from 10x to 3x on the dataset I am testing on, so a big speedup in practice.

That is amazing!

Short-term: I agree with @epruesse on the sister-thread, it may not be worth the overhead (though we are happy to assist), especially if you will have explicit SSE in the official Bioconductor release (medium term).

Long term: I suspect, but can't prove at the moment, that it would still be possible to compile dada2 within a QIIME 2 environment if you needed to. The issue at present is that Bioconductor's R is ahead of bioconda's R (so there isn't a way to use biocLite to fetch and compile it), but that seems like a transient issue. Worst case, we could ship an alternative build of q2-dada2 that didn't have a dependency on bioconductor-dada2 which meant you could self-compile (and would in fact need to).

thermokarst commented 6 years ago

This issue came up on the forum, leaving this here as a backref.

benjjneb commented 6 years ago

TO follow-up on the current status (after I got a bit side-tracked by the arrival of our second kid) the plan now is to merge the already-working explicit SSE implementation into the main dada2 branch as part of our upcoming 1.6 release (end of October) and then build the new conda-recipe off of that release version.e

I'd love to push something out ASAP, but I agree with some of the feedback (including above) that maintaining a Q2-specific branch is not a good idea in the long-term.

benjjneb commented 6 years ago

Explicit SSE has now been merged into the main branch, and is passing all checks/tests in the Bioconductor build system. I.e. SSE will make it into the upcoming 1.6 release!

Once we freeze for 1.6 I'll start working on a new bioconda build.

benjjneb commented 6 years ago

I am trying to get a new bioconda dada2 build going that uses the new 1.6 release, and am failing. The issue is that the build system in the current bioconda-recipes has moved to R 3.4.1, but there arent R 3.4.1 bioconda builds yet for a couple of dada2's dependences (esp. ShortRead and some of the parallelization packages).

This may not be the right place to ask, but I saw @ebolyen thread on seemingly related problems: https://github.com/bioconda/bioconda-recipes/issues/6341

Any advice? Presumably the bioconda folks will get the entire new Bioconductor release building eventually, so just waiting is an option. But I was really hoping to get this working sooner so the Q2-specific performance issues could be addressed.

ebolyen commented 6 years ago

I'm afraid I don't know too much either. Do you have a list of the dependencies that aren't recompiled yet? Maybe there is something we can do to help with that.

Otherwise, I'm sure the next bioconductor releases will be out at some point, so I don't think we're too worried about it. Hopefully it happens before the end of the year though.

epruesse commented 6 years ago

You could open a PR on bioconda for the package, even if it's not yet working. Put a [WIP] in the title. We can then see which packages need to be updated and get those going.

benjjneb commented 6 years ago

Thanks @epruesse I opened a PR: https://github.com/bioconda/bioconda-recipes/pull/6787

ebolyen commented 6 years ago

It looks like bioconductor-dada2 is up on bioconda with version 1.6.0 compiled against R 3.4.1!!

benjjneb commented 6 years ago

Awesome! Although... will it be a problem to use this build in Q2 because of the different R version?

Once this goes in it should be straightforward to add the explicit SSE flag to the R scripts (its off by default in the R package) and fingers crossed get a big chunk of that missing speed back.

ebolyen commented 6 years ago

Although... will it be a problem to use this build in Q2 because of the different R version?

Nope! We'll just start distributing the new R instead of the old one.

benjjneb commented 6 years ago

Some improvement in #78

Depending on dataset, speedups from 1.5x-4x vs. previous bioconda version/Q2 scripts in my fairly limited testing.

However, bioconda installed dada2 is still substantially slower than native installs (2x-10?x), and I'm not sure that will ever be fully rectified without bioconda updating to more modern compiler versions.

thermokarst commented 6 years ago

Fixed in #78. Closing this issue --- if we decide to look into additional performance gains we can reopen or create a new issue.

qiime2 / q2-dada2

conda installed dada2 is slow #74