Closed benjjneb closed 6 years ago
Ouch! Thanks for investigating @benjjneb, let us know what we can do to help.
Did some profiling of execution time of the core denoising algorithm on a test dataset for both the conda-installed version of dada2 and the natively installed version of dada2 (both 1.4.0):
Q2 env: (time command) user: 25m37.428s b_compare: 21.80min -sub_new: 20.96min —kmer_dist: 11.57min —nwalign_vectorized2: 8.64min —-dploop_vec: 4.91m —Al2subs: 24.39s b_p_update: 1.23m b_shuffle2: 36.94s b_bud: 35.51s
Native: (time command) user: 3m15.447s b_compare: 1.77min -sub_new: 1.51min —kmer_dist: <50s —nwalign_vectorized2: 50.21s —-dploop_vec: <50s —Al2subs: 5.20s b_p_update: 32.81s b_shuffle2: 9.51s b_bud: 16.58s
This indicates that the slowdown is even worse than 5x (could approach 20x) in the compiled parts of the code base critical to performance. Datasets with larger samples will spend more time in that part of the code, and approach a 10-20x slowdown.
The biggest slowdowns are in the bits that need to be auto-vectorized, supporting the idea that this is an issue with the compiled code being generated by the bioconda build.
Next step... how to get bioconda to appropriately compile the package!
Right now the feedback from the bioconda issue is not terribly promising. The problems seem to be that conda is reliant on an old version of gcc (4.8.5) that probably does not auto-vectorize well (at all?), and secondly that binaries are being constructed for the lowest-common denominator architecture -- so no SSE vectorization.
Something we could potentially do is create our own conda builds of dada2
where we would by default install a "modern" architecture (compiled appropriately), but we could have a different package that users could install otherwise.
This is kind of what continuum does with numpy/scipy, they have an nomkl
metapackage that installs a different set of packages if you don't have architecture support.
Update: I've branched a version of the package with one critical piece rewritten with SSE intrinsics (with the hope that will fix the very-old-compiler problem) and am testing a new conda recipe that uses that branch. The Q2 R scripts can also be modified to forgo our vectorized aligner and instead use the non-vectorized aligner (faster when there is no vectorization).
We'll see what happens, but in theory those changes could push the performance degradation from (as high as) 20x to more like 3x.
I have a working local conda build based on the explicit-SEE branch of the dada2 package: https://github.com/benjjneb/dada2/tree/explicit-SSE
On my machine this reduces the performance penalty of conda-installed dada2 vs. native dada2 from 10x to 3x on the dataset I am testing on, so a big speedup in practice.
What do people think is the best way to proceed in terms of getting this working for Q2 itself? I am leaning towards creating a new bioconda recipe (e.g. dada2-for-qiime2) that is built off this SSE feature branch combined with an update to the q2-dada2 R scripts. That is probably the fastest way to roll these improvements out.
Medium term, the explicit SSE branch will hopefully make it into the next official Bioconductor release (1.6) and once that makes it onto bioconda the special qiime2 recipe could be retired.
Long term, it is hard to see the bioconda build ever getting that close to performance parity with a native installation. Not sure what to do about that.
@benjjneb thanks so much for looking into this!
On my machine this reduces the performance penalty of conda-installed dada2 vs. native dada2 from 10x to 3x on the dataset I am testing on, so a big speedup in practice.
That is amazing!
Short-term: I agree with @epruesse on the sister-thread, it may not be worth the overhead (though we are happy to assist), especially if you will have explicit SSE in the official Bioconductor release (medium term).
Long term: I suspect, but can't prove at the moment, that it would still be possible to compile dada2 within a QIIME 2 environment if you needed to. The issue at present is that Bioconductor's R is ahead of bioconda's R (so there isn't a way to use biocLite
to fetch and compile it), but that seems like a transient issue. Worst case, we could ship an alternative build of q2-dada2
that didn't have a dependency on bioconductor-dada2
which meant you could self-compile (and would in fact need to).
This issue came up on the forum, leaving this here as a backref.
TO follow-up on the current status (after I got a bit side-tracked by the arrival of our second kid) the plan now is to merge the already-working explicit SSE implementation into the main dada2 branch as part of our upcoming 1.6 release (end of October) and then build the new conda-recipe off of that release version.e
I'd love to push something out ASAP, but I agree with some of the feedback (including above) that maintaining a Q2-specific branch is not a good idea in the long-term.
Explicit SSE has now been merged into the main branch, and is passing all checks/tests in the Bioconductor build system. I.e. SSE will make it into the upcoming 1.6 release!
Once we freeze for 1.6 I'll start working on a new bioconda build.
I am trying to get a new bioconda dada2 build going that uses the new 1.6 release, and am failing. The issue is that the build system in the current bioconda-recipes has moved to R 3.4.1, but there arent R 3.4.1 bioconda builds yet for a couple of dada2's dependences (esp. ShortRead and some of the parallelization packages).
This may not be the right place to ask, but I saw @ebolyen thread on seemingly related problems: https://github.com/bioconda/bioconda-recipes/issues/6341
Any advice? Presumably the bioconda folks will get the entire new Bioconductor release building eventually, so just waiting is an option. But I was really hoping to get this working sooner so the Q2-specific performance issues could be addressed.
I'm afraid I don't know too much either. Do you have a list of the dependencies that aren't recompiled yet? Maybe there is something we can do to help with that.
Otherwise, I'm sure the next bioconductor releases will be out at some point, so I don't think we're too worried about it. Hopefully it happens before the end of the year though.
You could open a PR on bioconda for the package, even if it's not yet working. Put a [WIP]
in the title. We can then see which packages need to be updated and get those going.
Thanks @epruesse I opened a PR: https://github.com/bioconda/bioconda-recipes/pull/6787
It looks like bioconductor-dada2
is up on bioconda with version 1.6.0
compiled against R
3.4.1
!!
Awesome! Although... will it be a problem to use this build in Q2 because of the different R version?
Once this goes in it should be straightforward to add the explicit SSE flag to the R scripts (its off by default in the R package) and fingers crossed get a big chunk of that missing speed back.
Although... will it be a problem to use this build in Q2 because of the different R version?
Nope! We'll just start distributing the new R instead of the old one.
Some improvement in #78
Depending on dataset, speedups from 1.5x-4x vs. previous bioconda version/Q2 scripts in my fairly limited testing.
However, bioconda installed dada2 is still substantially slower than native installs (2x-10?x), and I'm not sure that will ever be fully rectified without bioconda updating to more modern compiler versions.
Fixed in #78. Closing this issue --- if we decide to look into additional performance gains we can reopen or create a new issue.
When I run the q2-dada2 workflow (via Rscript, identical input data) on my machine, the processing time is 5x longer when run inside the q2 conda environment than when run in the standard shell (the native installation) using the same 1.4.0 version of the package.
More to come, but my initial suspicions is that when bioconda constructs the universal binary version of the package that no compiler optimizations are being performed. This could cause such a slowdown as the two performance critical parts of the algorithm are written to be auto-vectorized by
gcc
andllvm
rather than being explicitly vectorized.