holoviz-topics / examples

Visualization-focused examples using HoloViz for specific topics
https://examples.holoviz.org
Creative Commons Attribution 4.0 International
83 stars 24 forks source link

Add daskified example to bay_trimesh? #100

Open rsignell-usgs opened 4 years ago

rsignell-usgs commented 4 years ago

I see that trimesh has support for dask from the table in the datashader performance page: 2020-02-27_11-47-09 and there also an datashader trimesh notebook with a dask example

It would be great to add a similar dask example after the last cell in: https://github.com/pyviz-topics/examples/blob/master/bay_trimesh/bay_trimesh.ipynb to basically dask-enable this call:

datashade(hv.TriMesh((tris, points)), aggregator=ds.mean('z'), precompute=True)

When I try it, I get:

NotImplementedError: Dask dataframe does not support assigning non-scalar value.

My full notebook is here:
https://nbviewer.jupyter.org/gist/rsignell-usgs/cf853d43fd5e53ba90fbd4e0b9eb3da7

TomAugspurger commented 4 years ago

It's failing at https://github.com/holoviz/holoviews/blob/4dcddcc3a8a278dea550147def62f46ecd3d5d1d/holoviews/core/data/dask.py#L272-L277, where holoviews is trying to insert a numpy array into a dask dataframe.

df['index'] = np.arange(n)

but that would fail. We need to wrap the NumPy array in a Dask Array before inserting it. The tricky point is making sure that the chunks of the array matches the partitions of the dataframe.

TomAugspurger commented 4 years ago

Here's the basic idea

In [1]: import dask.dataframe as dd

In [2]: import dask

In [4]: import dask.array as da

In [5]: import numpy as np

In [6]: ts = dask.datasets.timeseries(end='2000-01-05')

In [7]: ts
Out[7]:
Dask DataFrame Structure:
                  id    name        x        y
npartitions=4
2000-01-01     int64  object  float64  float64
2000-01-02       ...     ...      ...      ...
2000-01-03       ...     ...      ...      ...
2000-01-04       ...     ...      ...      ...
2000-01-05       ...     ...      ...      ...
Dask Name: make-timeseries, 4 tasks

In [8]: index = np.arange(len(ts))

In [9]: chunks = tuple(ts.map_partitions(len).compute())

In [10]: ts['index'] = da.from_array(index, chunks)

In [11]: ts
Out[11]:
Dask DataFrame Structure:
                  id    name        x        y  index
npartitions=4
2000-01-01     int64  object  float64  float64  int64
2000-01-02       ...     ...      ...      ...    ...
2000-01-03       ...     ...      ...      ...    ...
2000-01-04       ...     ...      ...      ...    ...
2000-01-05       ...     ...      ...      ...    ...
Dask Name: assign, 21 tasks

The unfortunate part is the chunks = tuple(ts.map_partitions(len).compute()). I think that's unavoidable though, since we need the chunks to align... Hopefully that's not too expensive (or maybe holoviews already gets the chunk sizes elsewhere?)