scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 602 forks source link

Investigate Dask for speeding up Zheng17 #921

Open tomwhite opened 5 years ago

tomwhite commented 5 years ago

Can we use Dask to speed up the preprocessing phase of Scanpy by taking advantage of multiple CPUs (or GPUs)?

TLDR: Dask can speed up Zheng17, but it needs lots of cores and a rewritten implementation. CuPy (for GPUs) has missing operations required by Zheng17, so more work is needed for Dask with GPUs.

Investigation

Dask is mainly used with dense arrays, however the arrays in Scanpy are sparse (for most of the preprocessing phase at least). I tried looking at pydata sparse with Dask, but it ran a lot slower than regular scipy.sparse (which is what Scanpy uses).

So I wrote a wrapper around scipy.sparse to implement NumPy's __array_function__ protocol. This allows sparse arrays to be chunks in a Dask array. This approach seemed promising, with basic operations able to take take advantage of multiple cores and run faster than regular scipy.sparse.

However, when I first tried running the whole Zheng17 recipe, scipy.sparse was always faster than Dask with scipy.sparse, even with many cores (e.g. 64).

It turned out that by using Anndata arrays, Dask has to materialize intermediate data more than is necessary in order to populate the Anndata metadata. This is because the way Anndata works means that its metadata must be computed eagerly after each operation in the Zheng17 recipe, rather than lazily for the whole computation (which is the way Dask works).

To avoid this complication I rewrote the Zheng17 recipe to do all the NumPy array computations and then construct an Anndata representation at the end, to take advantage of Dask's deferred processing of lazy values. (See https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/preprocessing/_dask_optimized.py#L115 for the code.)

With this change, running on the 1M neurons dataset with 64 cores scipy.sparse takes 334s, while Dask with scipy.sparse takes 138s, a 2.4x speedup.

That's a significant speedup, but I'm not sure that it justifies the code overhead. I'd be interested to hear what others think.

Other notes

Code

See this branch: https://github.com/theislab/scanpy/compare/master...tomwhite:sparse-dask

CuPy and GPUs

I also wrote a wrapper around the GPU equivalent of scipy.sparse, cupyx.scipy.sparse.

Many operations work, however cupyx.scipy.sparse has a number of missing features that mean it can’t be used for Zheng17 yet. It would require significant work in CuPy to get it working:

NumPy 1.16 vs NumPy 1.17

I used NumPy 1.16 for the above experiments. However, when I tried NumPy 1.17 the Dask implementation slowed down significantly. I haven't been able to pinpoint the issue.

ivirshup commented 4 years ago

This is something I'd very much be interested in. A few questions

And a few questions about sparse matrices on the GPU:

So I wrote a wrapper around scipy.sparse to implement NumPy's __array_function__ protocol. This allows sparse arrays to be chunks in a Dask array.

👍. Any chance you've taken a look at implementing gufuncs? Oops, missed the __array_ufunc__ definition.

tomwhite commented 4 years ago

I'd really like to have scanpy and anndata work better with dask, but am wary of a high code overhead. Could you provide examples of where you were running into issues with arrays being materialized?

You can see where the materialization occurs by looking for references to materialize_as_ndarray in the existing code. For example, in filter_genes: https://github.com/theislab/scanpy/blob/master/scanpy/preprocessing/_simple.py#L215, where the gene subset of materialized as an ndarray, then used to subset the anndata.

Contrast this to the optimized version where the materialize step is not needed, and the data remains a dask array throughout the filter_genes method: https://github.com/tomwhite/scanpy/blob/sparse-dask/scanpy/preprocessing/_dask_optimized.py#L18

I think this can be worked around in AnnData side in many cases.

That would be great.

Any chance you did any profiling of these runs? I'd be interested in seeing the performance impact across the pipeline.

The closest I got to this was using the Dask web UI to watch tasks being run (see this part of the benchmark script: https://github.com/tomwhite/scanpy/blob/sparse-dask/benchmark.py#L54-L55). This is useful to see what operations are bottlenecks. The only timings I did were to run the complete recipe.

On the GPU questions, these all sound like promising avenues, but I haven't looked into any of them.

tomwhite commented 4 years ago

Hi @mrocklin, you might be interested in this work, especially the Dask-compatible wrapper around scipy.sparse.

mrocklin commented 4 years ago

Yes. I'm interested in many of the things here. Thank you for pinging me. I'm happy to engage going forward in a variety of ways. Let's start with a few responses.

I tried looking at pydata sparse with Dask, but it ran a lot slower than regular scipy.sparse (which is what Scanpy uses).

It would be great to get a slimmed down version of the operations that you're running with pydata/sparse and submit those to the issue tracker there. @hameerabbasi is usually pretty responsive, and I know that he appreciates learning about new use cases of pydata/sparse.

So I wrote a wrapper around scipy.sparse to implement NumPy's __array_function__ protocol. This allows sparse arrays to be chunks in a Dask array. This approach seemed promising, with basic operations able to take take advantage of multiple cores and run faster than regular scipy.sparse.

Thoughts on adding this to scipy.sparse itself so that we can avoid the wrapper? cc @rgommers

It turned out that by using Anndata arrays, Dask has to materialize intermediate data more than is necessary in order to populate the Anndata metadata. This is because the way Anndata works means that its metadata must be computed eagerly after each operation in the Zheng17 recipe, rather than lazily for the whole computation (which is the way Dask works).

Another option would be to see if you can swap out Anndata for Xarray. This is a big change obviously, and probably pretty disruptive to the existing codebase, but it would align you with many other software projects and scientific communities that are currently thinking about these exact same problems. My guess is that in the long run it would save you time, assuming that Xarray DataArrays meet your needs semantically.

Many operations work, however cupyx.scipy.sparse has a number of missing features that mean it can’t be used for Zheng17 yet. It would require significant work in CuPy to get it working:

I could imagine that these might be in scope for NVidia folks to work on in a few months (no promises though). If you wanted to raise these as issues there to track things that would be helpful. cc @jakirkham @pentschev

However, when I tried NumPy 1.17 the Dask implementation slowed down significantly. I haven't been able to pinpoint the issue

I would be curious to know what's going on here if you find out.

Any chance you did any profiling of these runs? I'd be interested in seeing the performance impact across the pipeline.

The closest I got to this was using the Dask web UI to watch tasks being run (see this part of the benchmark script: https://github.com/tomwhite/scanpy/blob/sparse-dask/benchmark.py#L54-L55). This is useful to see what operations are bottlenecks. The only timings I did were to run the complete recipe.

+1 on profiling. I suggest that you first start with compute(scheduler="single-threaded") and the cProfile module. This will avoid any parallelism, and hopefully let you use profiling techniques that are more familiar to you. I personally like snakeviz.

If you want to get on a screenshare some time I'm happy to look at dashboard plots with you.

rgommers commented 4 years ago

So I wrote a wrapper around scipy.sparse to implement NumPy's array_function protocol. This allows sparse arrays to be chunks in a Dask array. This approach seemed promising, with basic operations able to take take advantage of multiple cores and run faster than regular scipy.sparse.

Thoughts on adding this to scipy.sparse itself so that we can avoid the wrapper? cc @rgommers

See discussion in https://github.com/scipy/scipy/issues/10362 for this. The general sentiment is: probably not the best idea, because of the matrix/ndarray semantics not being compatible. More input on that SciPy issue is very welcome.

rgommers commented 4 years ago

The scipy.sparse wrapper is actually interesting. I think it's tricky to add directly to SciPy, but it could be split out as a separate package that users could use and we could link to in the scipy.sparse docs. With some context about things like @ vs * users can then make an informed decision to use it.

tomwhite commented 4 years ago

Thank you for the detailed response @mrocklin!

It would be great to get a slimmed down version of the operations that you're running with pydata/sparse and submit those to the issue tracker there.

I will try to produce a test case and post it there.

Another option would be to see if you can swap out Anndata for Xarray.

This has been discussed before (https://github.com/theislab/anndata/issues/32) but the sticking point was sparse support. Perhaps with some of the techniques being discussed in this issue it might become an option again, with all the benefits you outlined.

I could imagine that these might be in scope for NVidia folks to work on in a few months (no promises though). If you wanted to raise these as issues there to track things that would be helpful.

Thanks - I've opened issues for these features on the CuPy issue tracker.

tomwhite commented 4 years ago

The scipy.sparse wrapper is actually interesting. I think it's tricky to add directly to SciPy, but it could be split out as a separate package that users could use and we could link to in the scipy.sparse docs.

Any thoughts on where the scipy.sparse wrapper might live? It currently exposes just enough of the NumPy API for the purposes of ScanPy - i.e. it would need quite a bit more work to be more generally useful. Ditto for the cupyx.scipy.sparse wrapper.

jakirkham commented 4 years ago

Many operations work, however cupyx.scipy.sparse has a number of missing features that mean it can’t be used for Zheng17 yet. It would require significant work in CuPy to get it working:

I could imagine that these might be in scope for NVidia folks to work on in a few months (no promises though). If you wanted to raise these as issues there to track things that would be helpful. > cc @jakirkham @pentschev

FWIW made sure to cc us on the issues that you opened. Though please cc us on other ones.

Should add I think CuPy devs will want to keep their sparse implementation in pretty close alignment with SciPy's. So I don't think CuPy's sparse will solve any issues that SciPy's sparse does not also solve. However things that CuPy does not implement that SciPy does implement, are likely in scope. Though no idea where these sit in the priority queue ATM.

mrocklin commented 4 years ago

NVIDIA folks also recently ran into Dask Array + CuPy sparse issues. People here may be interested in tracking https://github.com/dask/dask/issues/5604

ivirshup commented 4 years ago

Another option would be to see if you can swap out Anndata for Xarray. This is a big change obviously, and probably pretty disruptive to the existing codebase, but it would align you with many other software projects and scientific communities that are currently thinking about these exact same problems. My guess is that in the long run it would save you time, assuming that Xarray DataArrays meet your needs semantically.

I've been wanting to use Xarray in the backend for AnnData, as AnnData objects are like a restricted Dataset. This is mainly blocked by not having CSC/ CSR sparse arrays compatible with Xarray, since we use those formats pretty heavily.

@tomwhite's sparse wrapper could be a solution to this, as xarray will accept these if an __array_function__ implementation is added. I tried a simple, broken in many cases, implementation which had promising results inside DataArrays. I'd definitely like to help fill this out a bit more.

__array_function__ implementation ```python def __array_function__(self, func, types, args, kwargs): result = func(*(x.value if isinstance(x, SparseArray) else x for x in args), **kwargs) if issparse(result): result = SparseArray(result) elif isinstance(result, np.matrix): result = np.asarray(result) return result ```

@mrocklin would it make sense for this SparseArray class to live in pydata/sparse as a pair of CSR/ CSC classes? The internals could gradually be replaced with a more generic n-dimensional representation, but would get two very common use cases into the library.

mrocklin commented 4 years ago

@mrocklin would it make sense for this SparseArray class to live in pydata/sparse as a pair of CSR/ CSC classes? The internals could gradually be replaced with a more generic n-dimensional representation, but would get two very common use cases into the library.

That's a question for @hameerabbasi , who leads pydata/sparse development. I think that he may also have other thoughts on supporting CSR/CSC classes. I think that he and others have been working on a general solution to this for a little while now.

rgommers commented 4 years ago

That general solution for CSR/CSC is the GXCS format, see https://github.com/pydata/sparse/issues/125. Now that SciPy 1.4.0 added support for using scipy.sparse.linalg with pydata/sparse arrays, and we have that GXCS format getting closer to being ready for general use, I think that's the main way forward.