holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.3k stars 366 forks source link

ndarray TypeError when visualising Dask Array #970

Closed peterroelants closed 3 years ago

peterroelants commented 3 years ago

When trying to visualize a Dask array with raster and shade I get a TypeError: data must be an ndarray error in the eq_hist method. From the DataShader documentation I expected Dask arrays to be supported in DataShader.

I have provided a minimal notebook at https://gist.github.com/peterroelants/1d77e09bd05cc55c240bc11983e2a0c4 to reproduce the error.

ALL software version info

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.19.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.4.0-54-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 4
Architecture: 64bit

datashader: 0.11.1
sys       : 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) 
[GCC 7.5.0]
numpy     : 1.19.4
xarray    : 0.16.1
dask      : 2.30.0

Description of expected behavior and the observed behavior

I expect shade being able to visualise a Dask array.

Complete, minimal, self-contained example code that reproduces the issue

https://gist.github.com/peterroelants/1d77e09bd05cc55c240bc11983e2a0c4

Stack traceback and/or browser JavaScript console output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-a01bef0eda6c> in <module>
      1 canvas = datashader.Canvas(plot_width=900, plot_height=400)
      2 datashader.transfer_functions.Images(
----> 3     datashader.transfer_functions.shade(
      4         canvas.raster(data_da, agg=datashader.reductions.mean('z'))
      5     )

~/miniconda3/envs/msi/lib/python3.8/site-packages/datashader/transfer_functions/__init__.py in shade(agg, cmap, color_key, how, alpha, min_alpha, span, name, color_baseline)
    509 
    510     if agg.ndim == 2:
--> 511         return _interpolate(agg, cmap, how, alpha, span, min_alpha, name)
    512     elif agg.ndim == 3:
    513         return _colorize(agg, color_key, how, alpha, span, min_alpha, name, color_baseline)

~/miniconda3/envs/msi/lib/python3.8/site-packages/datashader/transfer_functions/__init__.py in _interpolate(agg, cmap, how, alpha, span, min_alpha, name)
    240     with np.errstate(invalid="ignore", divide="ignore"):
    241         # Transform data (log, eq_hist, etc.)
--> 242         data = interpolater(data, mask)
    243 
    244         # Transform span

~/miniconda3/envs/msi/lib/python3.8/site-packages/datashader/transfer_functions/__init__.py in eq_hist(data, mask, nbins)
    163         from._cuda_utils import interp
    164     elif not isinstance(data, np.ndarray):
--> 165         raise TypeError("data must be an ndarray")
    166     else:
    167         interp = np.interp

TypeError: data must be an ndarray

Where data is a dask.array.core.Array.

jbednar commented 3 years ago

Datashader supports Dask DataFrames, not Dask Arrays; for n-dimensional arrays use an Xarray DataArray instead. I don't think it would be difficult to support a Dask Array, but for all the use cases we've contemplated a DataArray (backed by Dask) is strictly a superset of a Dask Array, and should have the same performance.

peterroelants commented 3 years ago

I don't think have been clear enough. I'm using a XArray array, the XArray happens to have a Dask array as data. This combination is claimed to be supported by https://datashader.org/user_guide/Performance.html (Xarray+DaskArray).

I'm getting the error on a Dask Array since utils.orient_array extracts the Dask data array via .data from the xarray raster. This Dask array is then further passed to eq_hist.

I think I can fix this issue by forcing a compute() when the data object is a dask array in transfer_functions._interpolate just after orient_array returns. I can try a PR if you think this would be a good first stab at the issue?

jbednar commented 3 years ago

Ah, I see. Looking at your gist, yes, calling .compute() before shade() will fix it:

cvs = datashader.Canvas(plot_width=900, plot_height=400)
agg = cvs.raster(data_da, agg=datashader.reductions.mean('z'))
img = datashader.transfer_functions.shade(agg.compute())
img

image

So yes, it would be great to see a PR to shade() to call .compute() first if it sees a Dask Array-backed DataArray. In the meantime, just call .compute() before calling shade().

peterroelants commented 3 years ago

I tried a first stab at fixing this issue: https://github.com/holoviz/datashader/pull/971

Would love some help getting this in if you think it's ok.

jbednar commented 3 years ago

Thanks for the fix!

peterroelants commented 3 years ago

Thanks for merging this in!

I might also have a look at the QuadMesh soon, last time I checked it tried to load my whole DaskArray in memory. Are you aware of any issues there that I could look into?

jbednar commented 3 years ago

QuadMesh should support Dask-backed Xarray quadmeshes properly since version 0.11.0 (see https://github.com/holoviz/datashader/pull/885), and I don't know of any regressions introduced in 0.11.1 or in master. So the first step would be to make a reproducible example of any bug or problem, and we can go from there. Thanks!

peterroelants commented 3 years ago

QuadMesh should support Dask-backed Xarray quadmeshes properly since version 0.11.0 (see #885), and I don't know of any regressions introduced in 0.11.1 or in master. So the first step would be to make a reproducible example of any bug or problem, and we can go from there. Thanks!

I created an example of what I meant and filed an issue at https://github.com/holoviz/datashader/issues/972 .