pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

Need documentation on sparse / cupy integration #3484

Open rabernat opened 5 years ago

rabernat commented 5 years ago

In https://github.com/pydata/xarray/issues/1375#issuecomment-526432439, @fjanoos asked:

Is there documentation for using sparse arrays ? Could you point me to some example code ?

@dcherian:

there isn't any formal documentation yet but you can look at test_sparse.py for examples. That file will also tell you what works and doesn't work currently.

If we want people to take advantage of this cool new capability, we need to document it! I'm at pydata NYC and want to share something about this, but it's hard to know where to start without docs.

xref https://github.com/pydata/xarray/issues/3245

dcherian commented 5 years ago

maybe @friedrichknuth can contribute some :)

friedrichknuth commented 5 years ago

After reading through the issue tracker and PRs, it looks like sparse arrays can safely be wrapped with xarray, thanks to the work done in PR#3117, but built-in functions are still under development (e.g. PR#3542). As a user, here is what I am seeing when test driving sparse:

Sparse gives me a smaller in-memory array

In [1]: import xarray as xr, sparse, sys, numpy as np, dask.array as da

In [2]: x = np.random.random((100, 100, 100))

In [3]: x[x < 0.9] = np.nan

In [4]: s = sparse.COO.from_numpy(x, fill_value=np.nan)

In [5]: sys.getsizeof(s)
Out[5]: 3189592

In [6]: sys.getsizeof(x)
Out[6]: 8000128

Which I can wrap with dask and xarray

In [7]: x = da.from_array(x)

In [8]: s = da.from_array(s)

In [9]: ds_dense = xr.DataArray(x).to_dataset(name='data_variable')

In [10]: ds_sparse = xr.DataArray(s).to_dataset(name='data_variable')

In [11]: ds_dense
Out[11]:
<xarray.Dataset>
Dimensions:        (dim_0: 100, dim_1: 100, dim_2: 100)
Dimensions without coordinates: dim_0, dim_1, dim_2
Data variables:
    data_variable  (dim_0, dim_1, dim_2) float64 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>

In [12]: ds_sparse
Out[12]:
<xarray.Dataset>
Dimensions:        (dim_0: 100, dim_1: 100, dim_2: 100)
Dimensions without coordinates: dim_0, dim_1, dim_2
Data variables:
    data_variable  (dim_0, dim_1, dim_2) float64 dask.array<chunksize=(100, 100, 100), meta=sparse.COO>

However, computation on a sparse array takes longer than running compute on a dense array (which I think is expected...?)

In [13]: %%time
    ...: ds_sparse.mean().compute()
CPU times: user 487 ms, sys: 22.9 ms, total: 510 ms
Wall time: 518 ms
Out[13]:
<xarray.Dataset>
Dimensions:        ()
Data variables:
    data_variable  float64 0.9501

In [14]: %%time
    ...: ds_dense.mean().compute()
CPU times: user 10.9 ms, sys: 3.91 ms, total: 14.8 ms
Wall time: 13.8 ms
Out[14]:
<xarray.Dataset>
Dimensions:        ()
Data variables:
    data_variable  float64 0.9501

And writing to netcdf, to take advantage of the smaller data size, doesn't work out of the box (yet)

In [15]: ds_sparse.to_netcdf('ds_sparse.nc')
Out[15]: ...
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.

Additional discussion happening at #3213

@dcherian @shoyer Am I missing any built-in methods that are working and ready for public release? Happy to send in a PR, if any of what is provided here should go into a basic example for the docs.

At this stage, I am not using sparse arrays for my own research just yet, but when I get to that anticipated phase I can dig in more on this and hopefully send in some useful PRs for improved documentation and fixes/features.

k-a-mendoza commented 5 years ago

@friedrichknuth One of my motivations behind exploring sparse DataArray backends is in reducing the memory footprint during merge operations. Consider the following:

One can imagine many such merge operations producing a lot of effectively empty indices. While sparse backed arrays might have the ability to condense these empty indices in memory, it seems like xarray sparse merging isnt quite compatible yet.

mazzma12 commented 4 years ago

Hello, do you have any documentation on how to plot data in a sparse array using xarray.plot accessor? I get this error, but if I convert to numpy/scipy with todense() method I will likely lose the convenient plot method from xarray... Thank you for your help

-------------------------------------------------------------------------
RuntimeError                            Traceback (most recent call last)
<ipython-input-331-9d69abf57c11> in <module>
----> 1 slice_res_ds['value'].plot()

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/plot/plot.py in __call__(self, **kwargs)
    463 
    464     def __call__(self, **kwargs):
--> 465         return plot(self._da, **kwargs)
    466 
    467     @functools.wraps(hist)

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/plot/plot.py in plot(darray, row, col, col_wrap, ax, hue, rtol, subplot_kws, **kwargs)
    200     kwargs["ax"] = ax
    201 
--> 202     return plotfunc(darray, **kwargs)
    203 
    204 

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/plot/plot.py in newplotfunc(darray, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, center, robust, extend, levels, infer_intervals, colors, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs)
    692 
    693         # Pass the data as a masked ndarray too
--> 694         zval = darray.to_masked_array(copy=False)
    695 
    696         # Replace pd.Intervals if contained in xval or yval.

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/dataarray.py in to_masked_array(self, copy)
   2301             Masked where invalid values (nan or inf) occur.
   2302         """
-> 2303         values = self.values  # only compute lazy arrays once
   2304         isnull = pd.isnull(values)
   2305         return np.ma.MaskedArray(data=values, mask=isnull, copy=copy)

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/dataarray.py in values(self)
    565     def values(self) -> np.ndarray:
    566         """The array's data as a numpy.ndarray"""
--> 567         return self.variable.values
    568 
    569     @values.setter

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/variable.py in values(self)
    446     def values(self):
    447         """The variable's data as a numpy.ndarray"""
--> 448         return _as_array_or_item(self._data)
    449 
    450     @values.setter

~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/variable.py in _as_array_or_item(data)
    252     TODO: remove this (replace with np.asarray) once these issues are fixed
    253     """
--> 254     data = np.asarray(data)
    255     if data.ndim == 0:
    256         if data.dtype.kind == "M":

~/.pyenv/versions/emi/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

~/.pyenv/versions/emi/lib/python3.6/site-packages/sparse/_sparse_array.py in __array__(self, **kwargs)
    221         if not AUTO_DENSIFY:
    222             raise RuntimeError(
--> 223                 "Cannot convert a sparse array to dense automatically. "
    224                 "To manually densify, use the todense method."
    225             )

RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
dcherian commented 4 years ago

da.copy(data=da.data.todense()).plot() should work.

We should add as_sparse and to_dense methods. See discussion here: https://github.com/pydata/xarray/issues/3245. A PR would be very welcome if you have the time.

mazzma12 commented 4 years ago

da.copy(data=da.data.todense()).plot() should work.

It works indeed, thank you!