End to end workflows on GPU

cjnolet commented 4 years ago

It would be very useful for the GPU data science and research community if Scanpy were able to perform end to end workflows on the GPU, using either Cupy, CuDF or both.

An initial iteration of this feature could include simply swapping out the numpy imports for cupy.

[X] Additional function parameters / changed functionality / changed defaults?
[ ] New analysis tool: A simple analysis tool you have been using and are missing in sc.tools?
[ ] New plotting function: A kind of plot you would like to seein sc.pl?
[ ] External tools: Do you know an existing package that should go into sc.external.*?
[ ] Other?

...

quasiben commented 4 years ago

I recall looking at some ScanPy workflows and @davidsebfischer pointed out that many of them rely on Sparse data types. Is that still the case ?

davidsebfischer commented 4 years ago

I recall looking at some ScanPy workflows and @davidsebfischer pointed out that many of them rely on Sparse data types. Is that still the case ?

yes, this still is the case

cjnolet commented 4 years ago

@quasiben @davidsebfischer. I've been working on using RAPIDS/CuPy to implement a Seurat / Scanpy single-cell RNA workflow. Specifically, I've been finding it quite challenging do w/ CuPy sparse arrays because of the following two issues:

https://github.com/cupy/cupy/issues/2360 https://github.com/cupy/cupy/issues/3178

Currently, I'm having to convert to scipy.sparse to implement filtering.

quasiben commented 4 years ago

Do you know how hard it would be to add cuSparse to CuPy for more sparse support ?

cjnolet commented 4 years ago

@quasiben As far as I know Cusparse is being used under Cupy currently for a lot of the operations.

I’m not quite sure why those slicing strategies aren’t supported yet. I just figured maybe they were less trivial than the others and weren’t immediately needed so they were pushed off to future feature requests.

The issue #2360 I can’t imagine is too hard- I imagine the output array the size of the selection list could be allocated and a Cuda kernel scheduled to write the selected entries in parallel.

I’m not as sure about the other issue, but what Dask is trying to do seems more like an API compatibility issue than one of performance/compute.

jakirkham commented 4 years ago

What features in cuSPARSE would be useful for slicing?

cjnolet commented 4 years ago

@jakirkham, I’m not sure about slicing. @quasiben ’s question seems to imply the use of cusparse in cupy for general sparse operations. Please correct me if I misunderstood.

I believe once the two issues above are resolved, much of the scipy.sparse functionality for the preprocessing in Scanpy should be able to be swapped with cupy.sparse.

The ML stuff is a little but different, and I’ve created a separate issue to track that discussion.

davidsebfischer commented 4 years ago

I think our initially identified bottleneck with using sparse arrays was this here https://github.com/cupy/cupy/issues/2359.

The analysis workflows usually have very clear computational bottlenecks, so the translation to GPU should take this into consideration: Is it feasible in terms of available code to keep the array on GPU and actually perform all operations there or will this stay a CPU centric library that deploys particular steps to GPU. Inbatchglm / diffxpy we took the first approach, we build ontop of (a CPU centric scanpy and) deployed GLM fitting to GPU via tensorflow2, we also use estimation code in dask in the same package that we could in principle use with cupy, right now this just sits ontop of numpy.

Happy to be involved with this stuff, I spent some time thinking about this with @quasiben already. I think it is really crucial to figure out where it makes sense to invest time to build pipelines that can be end-to-end be executed on GPU: because of the large number of tools this will not be the entire scanpy tool environment for a long time, so mixed workflows will be necessary.

I would for example restrict all efforts to the submodule sc.tl for now because this contains most potential bottlenecks I think that are frequently used. "end-to-end" doesnt need to go all the way up to analysis graph leaves, such as plotting, in my opinion, as their is little performance gain there.
Nice to have for non-core functionalities would then be some examples of how GPU-based arrays can be used within anndata so that 3rd parties can modify their tools to directly operate on the GPU array rather then starting to copy arrays. I think this is not really clear for most people right now (I have never done that either) and documenting this properly / improving this would help a lot.

Zethson commented 6 months ago

https://github.com/scverse/rapids_singlecell is the solution! 🚀

scverse / scanpy

End to end workflows on GPU #1177