pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.54k stars 1.06k forks source link

Sparse arrays #1375

Closed mrocklin closed 5 years ago

mrocklin commented 7 years ago

I would like to have an XArray that has scipy.sparse arrays rather than numpy arrays. Is this in scope?

What would need to happen within XArray to support this?

shoyer commented 7 years ago

Yes, I would say this is in scope, as long as we can keep most of the data-type specific logic out of xarray's core (which seems doable).

Currently, we define most of our operations on duck arrays in https://github.com/pydata/xarray/blob/master/xarray/core/duck_array_ops.py

There are a few other hacks throughout the codebase, which can find by searching for "dask_array_type": https://github.com/pydata/xarray/search?p=1&q=dask_array_type&type=&utf8=%E2%9C%93

It's pretty crude, but basically this would need to be extended to implement many of these methods on for sparse arrays, too. Ideally we would define xarray's adapter logic into more cleanly separated submodules, perhaps using multiple dispatch. Even better, we would make this public API, so you can write something like xarray.register_data_type(MySparseArray) to register a type as valid for xarray's .data attribute.

It looks like __array_ufunc__ will actually finally land in NumPy 1.13, which might make this easier.

See also https://github.com/pydata/xarray/pull/1118

rabernat commented 7 years ago

👍 to the scipy.sparse array suggestion

[While we are discussing supporting other array types, we should keep gpu arrays on the radar]

benbovy commented 7 years ago

Although I don't know much about SciDB, it seems to be another possible application for xarray.register_data_type.

mrocklin commented 7 years ago

Here is a brief attempt at a multi-dimensional sparse array: https://github.com/mrocklin/sparse

It depends on numpy and scipy.sparse and, with the exception of a bit of in-memory data movement and copies, should run at scipy speeds (though I haven't done any benchmarking).

@rabernat do you have an application that we could use to drive this?

rabernat commented 7 years ago

@rabernat do you have an application that we could use to drive this?

Nothing comes to mind immediately. My data are unfortunately quite dense! 😜

olgabot commented 7 years ago

In case you're still looking for an application, gene expression from single cells (see data/00_original/GSM162679$i_P14Retina_$j.digital_expression.txt.gz) is very sparse due to high gene dropout. The shape is expression.shape (49300, 24760) and it's mostly zeros or nans. A plain csv from this data was 2.5 gigs, which gzipped to 300 megs.

Here is an example of using xarray to combine these files but my kernel keeps dying when I do ds.to_netcdf() :(

Hope this is a good example for sparse arrays!

rth commented 6 years ago

do you have an application that we could use to drive this?

Other examples where labeled sparse arrays would be useful are,

rabernat commented 6 years ago

Sparse Xarray DataArrays would be useful for the linear regridding operations discussed in JiaweiZhuang/xESMF#3.

lbybee commented 6 years ago

I'm interested to see if there have been any developments on this. I currently have an application where I'm working with multiple dask arrays, some of which are sparse (text data). It'd be worth my time to move my project to xarray, so I'm be interested in contributing something here if there is a need.

Hoeze commented 6 years ago

I'd know a project which could make perfect use of xarray, if it would support sparse tensors: https://github.com/theislab/anndata

Currently I have to work with both xarray and anndata to store counts in sparse arrays separate from other depending data which is a little bit annoying :)

shoyer commented 6 years ago

See also: https://github.com/pydata/xarray/issues/1938

The major challenge now is the dispatching mechanism, which hopefully http://www.numpy.org/neps/nep-0018-array-function-protocol.html will solve.

Hoeze commented 6 years ago

Would it be an option to use dask's sparse support? http://dask.pydata.org/en/latest/array-sparse.html This way xarray could let dask do the main work.

Currently I load everything into a dask array by hand and pass this dask array to xarray. This works pretty good.

Hoeze commented 6 years ago

How should these sparse arrays get stored in NetCDF4? I know that NetCDF4 has some conventions how to store sparse data, but do we have to implement our own conversion mechanisms for each sparse type?

shoyer commented 6 years ago

Would it be an option to use dask's sparse support? http://dask.pydata.org/en/latest/array-sparse.html This way xarray could let dask do the main work.

In principle this would work, though I would prefer to support it directly in xarray, too.

I know that NetCDF4 has some conventions how to store sparse data, but do we have to implement our own conversion mechanisms for each sparse type?

Yes, we would need to implement a convention for handling sparse array data.

rabernat commented 5 years ago

Given the recent improvements in numpy duck array typing, how close are we to being able to just wrap a pydata/sparse array in an xarray Dataset?

shoyer commented 5 years ago

It will need some experimentation, but I think things should be pretty close after NumPy 1.17 is released. Potentially it could be as easy as adjusting the rules xarray uses for casting in xarray.core.variable.as_compatible_data.

rabernat commented 5 years ago

If someone who is good at numpy shows up at our sprint tomorrow, this could be a good issue try out.

mrocklin commented 5 years ago

@rgommers might be able to recommend someone

rgommers commented 5 years ago

I haven't talked to anyone at SciPy'19 yet who was interested in sparse arrays, but I'll keep an eye out today.

And yes, this is a fun issue to work on and would be really nice to have!

rabernat commented 5 years ago

I personally use the new sparse project for my day-to-day research. I am motivated on this, but I probably won't have time today to dive deep on this.

Maybe CuPy would be more exciting.

mrocklin commented 5 years ago

@nvictus has been working on this at #3117

fjanoos commented 5 years ago

Wondering what the status on this is ? Is there a branch with this functionality implemented - would love to give it a spin !

shoyer commented 5 years ago

This is working now on the master branch!

Once we get a few more kinks worked out, it will be in the next release.

I've started another issue for discussing how xarray could integrate sparse arrays better into its API: https://github.com/pydata/xarray/issues/3213

fjanoos commented 4 years ago

@shoyer Is there documentation for using sparse arrays ? Could you point me to some example code ?

dcherian commented 4 years ago

@fjanoos there isn't any formal documentation yet but you can look at test_sparse.py for examples. That file will also tell you what works and doesn't work currently.