[FEA] SparseDataFrame data structure

cornhundred commented 5 years ago

Is your feature request related to a problem? Please describe. I have sparse data (>95% sparse single cell gene expression data) stored in a parquet file and I would like to directly load it into a sparse cuDF. Making the DataFrame dense would cause GPU memory to run out.

Describe the solution you'd like I would like to be able to generate a sparse gdf from a parquet file and write a parquet file from a sparse gdf (both operations ideally would not require making the matrix dense).

Describe alternatives you've considered @kkraus14 suggested using Chainer cupy to work with sparse data. However, cuDF can only interoperate with their dense arrays.

Additional context

This was discussed on Twitter https://twitter.com/franschrandez/status/1126694232897359873?s=20 with @kkraus14

This Google Colab notebook shows memory usage differences between Pandas DataFrame and SparseDataFrame as well as making a gdf (ussing cudf.from_pandas). The data for the notebook can be obtained here https://github.com/ismms-himc/clustergrammer2-notebooks/tree/master/data/pbmc3k_filtered_gene_bc_matrices/hg19

beckernick commented 5 years ago

@cornhundred, based on your description it sounds like the core of your feature request is that you'd like a SparseDataFrame data structure, with the ability to read from and write to parquet as a sub-feature. I'm going to update this issue's title accordingly. Please let me know if I'm off base.

Additionally, would you be able to share a bit more about your use case? Is it captured in your notebook from the Twitter thread? What would you like to do with your data in cuDF that makes you want to go from a sparse CuPy matrix to a cuDF dataframe.

cornhundred commented 5 years ago

Hi @beckernick, yes that rename makes sense - I would like a cuDF sparse data structure that reads/writes to parquet. This data structure would be similar to Pandas SparseDataFrame (which is set to be deprecated - see issue), or DataFrame using SparseArray extension (see same issue).

I'm working on some Google Colab notebooks to show off our specific use cases with our sparse datasets and RAPIDS. However, the NBViewer notebook shows an example workflow without RAPIDS: load data (typically thousands of samples/cells in ~30,000 dimensions/genes), hierarchical clustering (usually of top 250-500 variable dimensions/genes) and visualization using the interactive Clustergrammer2 heatmap, load external reference signature dataset, label samples/cells (by assigning to nearest signature).

It would be great to move more of our workflow to RAPIDS (e.g. K-means clustering, UMAP, hierarchical clustering, pairwise distance calculations etc). I haven't actually used CuPy yet, but it was suggested since they already support sparse data structures (however, I do not think they support generating sparse data structures directly from parquet).

I was able to reproduce the Fashion-MNIST UMAP example from this Medium post on Colab - I'm seeing a 60X speedup. I'm working on modifying this post to work with our data.

rlu-aa commented 4 years ago

Hello guys, is there any ETA or progress on this? The issue I'm facing is I got a medium scale user-item matrix that, if converted to dense, cannot fit into host/device memory, whereas cuml model.fit requires X to be a dense matrix. An impasse?

thanasions commented 2 years ago

similar probs here. Any updates?

GregoryKimball commented 2 years ago

Thank you @rlu-aa and @thanasions for your continued interest. Based on the earlier discussion, the sparse matrix representation can be stored in a cuDF dataframe, but it is not interoperable in that format with cuML. Has this issue received discussion in the cuML repo yet?

vyasr commented 6 months ago

While I do see the value (and there is some interest above), I don't think libcudf is going to add support for a non-arrow type anytime soon. For us to support this, then, we first need to see sparse columns added to the arrow spec. In principle we could try and implement this at just the Python layer, but that would also be quite challenging without a pretty thorough refactoring of our internals to better support a non-injective mapping from Python types to C++ types. That work is somewhere on the radar to improve our handling of categoricals, but is quite far out, and even with that we would then need a custom type specialized for this purpose built on top of that machinery. While I think it is doable, I think this is out of scope, so I'm going to close this issue. If there is sufficient interest in the future, I would be open to reconsidering, but we would need to expend substantial development effort to make this performant enough to be useful, and I have a hard time seeing that happening.

rapidsai / cudf

[FEA] SparseDataFrame data structure #1790