Open rabernat opened 5 years ago
maybe @friedrichknuth can contribute some :)
After reading through the issue tracker and PRs, it looks like sparse arrays can safely be wrapped with xarray, thanks to the work done in PR#3117, but built-in functions are still under development (e.g. PR#3542). As a user, here is what I am seeing when test driving sparse:
Sparse gives me a smaller in-memory array
In [1]: import xarray as xr, sparse, sys, numpy as np, dask.array as da
In [2]: x = np.random.random((100, 100, 100))
In [3]: x[x < 0.9] = np.nan
In [4]: s = sparse.COO.from_numpy(x, fill_value=np.nan)
In [5]: sys.getsizeof(s)
Out[5]: 3189592
In [6]: sys.getsizeof(x)
Out[6]: 8000128
Which I can wrap with dask and xarray
In [7]: x = da.from_array(x)
In [8]: s = da.from_array(s)
In [9]: ds_dense = xr.DataArray(x).to_dataset(name='data_variable')
In [10]: ds_sparse = xr.DataArray(s).to_dataset(name='data_variable')
In [11]: ds_dense
Out[11]:
<xarray.Dataset>
Dimensions: (dim_0: 100, dim_1: 100, dim_2: 100)
Dimensions without coordinates: dim_0, dim_1, dim_2
Data variables:
data_variable (dim_0, dim_1, dim_2) float64 dask.array<chunksize=(100, 100, 100), meta=np.ndarray>
In [12]: ds_sparse
Out[12]:
<xarray.Dataset>
Dimensions: (dim_0: 100, dim_1: 100, dim_2: 100)
Dimensions without coordinates: dim_0, dim_1, dim_2
Data variables:
data_variable (dim_0, dim_1, dim_2) float64 dask.array<chunksize=(100, 100, 100), meta=sparse.COO>
However, computation on a sparse array takes longer than running compute on a dense array (which I think is expected...?)
In [13]: %%time
...: ds_sparse.mean().compute()
CPU times: user 487 ms, sys: 22.9 ms, total: 510 ms
Wall time: 518 ms
Out[13]:
<xarray.Dataset>
Dimensions: ()
Data variables:
data_variable float64 0.9501
In [14]: %%time
...: ds_dense.mean().compute()
CPU times: user 10.9 ms, sys: 3.91 ms, total: 14.8 ms
Wall time: 13.8 ms
Out[14]:
<xarray.Dataset>
Dimensions: ()
Data variables:
data_variable float64 0.9501
And writing to netcdf, to take advantage of the smaller data size, doesn't work out of the box (yet)
In [15]: ds_sparse.to_netcdf('ds_sparse.nc')
Out[15]: ...
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
Additional discussion happening at #3213
@dcherian @shoyer Am I missing any built-in methods that are working and ready for public release? Happy to send in a PR, if any of what is provided here should go into a basic example for the docs.
At this stage, I am not using sparse arrays for my own research just yet, but when I get to that anticipated phase I can dig in more on this and hopefully send in some useful PRs for improved documentation and fixes/features.
@friedrichknuth One of my motivations behind exploring sparse DataArray backends is in reducing the memory footprint during merge operations. Consider the following:
One can imagine many such merge operations producing a lot of effectively empty indices. While sparse backed arrays might have the ability to condense these empty indices in memory, it seems like xarray sparse merging isnt quite compatible yet.
Hello, do you have any documentation on how to plot data in a sparse array using xarray.plot
accessor?
I get this error, but if I convert to numpy/scipy with todense()
method I will likely lose the convenient plot method from xarray... Thank you for your help
-------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-331-9d69abf57c11> in <module>
----> 1 slice_res_ds['value'].plot()
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/plot/plot.py in __call__(self, **kwargs)
463
464 def __call__(self, **kwargs):
--> 465 return plot(self._da, **kwargs)
466
467 @functools.wraps(hist)
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/plot/plot.py in plot(darray, row, col, col_wrap, ax, hue, rtol, subplot_kws, **kwargs)
200 kwargs["ax"] = ax
201
--> 202 return plotfunc(darray, **kwargs)
203
204
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/plot/plot.py in newplotfunc(darray, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, center, robust, extend, levels, infer_intervals, colors, subplot_kws, cbar_ax, cbar_kwargs, xscale, yscale, xticks, yticks, xlim, ylim, norm, **kwargs)
692
693 # Pass the data as a masked ndarray too
--> 694 zval = darray.to_masked_array(copy=False)
695
696 # Replace pd.Intervals if contained in xval or yval.
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/dataarray.py in to_masked_array(self, copy)
2301 Masked where invalid values (nan or inf) occur.
2302 """
-> 2303 values = self.values # only compute lazy arrays once
2304 isnull = pd.isnull(values)
2305 return np.ma.MaskedArray(data=values, mask=isnull, copy=copy)
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/dataarray.py in values(self)
565 def values(self) -> np.ndarray:
566 """The array's data as a numpy.ndarray"""
--> 567 return self.variable.values
568
569 @values.setter
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/variable.py in values(self)
446 def values(self):
447 """The variable's data as a numpy.ndarray"""
--> 448 return _as_array_or_item(self._data)
449
450 @values.setter
~/.pyenv/versions/emi/lib/python3.6/site-packages/xarray/core/variable.py in _as_array_or_item(data)
252 TODO: remove this (replace with np.asarray) once these issues are fixed
253 """
--> 254 data = np.asarray(data)
255 if data.ndim == 0:
256 if data.dtype.kind == "M":
~/.pyenv/versions/emi/lib/python3.6/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
~/.pyenv/versions/emi/lib/python3.6/site-packages/sparse/_sparse_array.py in __array__(self, **kwargs)
221 if not AUTO_DENSIFY:
222 raise RuntimeError(
--> 223 "Cannot convert a sparse array to dense automatically. "
224 "To manually densify, use the todense method."
225 )
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
da.copy(data=da.data.todense()).plot()
should work.
We should add as_sparse
and to_dense
methods. See discussion here: https://github.com/pydata/xarray/issues/3245. A PR would be very welcome if you have the time.
da.copy(data=da.data.todense()).plot() should work.
It works indeed, thank you!
In https://github.com/pydata/xarray/issues/1375#issuecomment-526432439, @fjanoos asked:
@dcherian:
If we want people to take advantage of this cool new capability, we need to document it! I'm at pydata NYC and want to share something about this, but it's hard to know where to start without docs.
xref https://github.com/pydata/xarray/issues/3245