holoviz / datashader

Quickly and accurately render even the largest data.
http://datashader.org
BSD 3-Clause "New" or "Revised" License
3.32k stars 365 forks source link

Xarray + DaskArray out-of-memory on QuadMesh where raster works #972

Open peterroelants opened 3 years ago

peterroelants commented 3 years ago

My Python kernel gets killed because of an out-of-memory issue when generating a quadmesh from a large Xarray DataArray using DaskArrays as data. Visualising the same DataArray with raster works using Dask's out-of-core computation support.

When looking at the Dask scheduler I noticed that nothing gets scheduled when creating the quadmesh (I think because the whole DataArray is forced into memory). While using raster shows a nice computation graph and is able to compute the figure out-of-core without any issues.

To illustrate this I created two notebooks trying to render a figure from the same DataArray

  1. Successfully running raster: https://gist.github.com/peterroelants/0624834713a1388c7f57d3cafd9b800b
    • Dask schedules the computations needed to compute the image (See screenshots below).
  2. Failure of running quadmesh: https://gist.github.com/peterroelants/dd5375ed5d58e1dfd72bc2003539124d
    • Dask scheduler stays empty when running this. No tasks are scheduled. Presumable because the whole DataArray is forced into memory before any of the computation happens.

ALL software version info

Python implementation: CPython
Python version       : 3.8.6
IPython version      : 7.19.0

Compiler    : GCC 7.5.0
OS          : Linux
Release     : 5.4.0-54-generic
Machine     : x86_64
Processor   : x86_64
CPU cores   : 4
Architecture: 64bit

sys       : 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) 
[GCC 7.5.0]
holoviews : 1.13.5
datashader: 0.11.2a5
xarray    : 0.16.1
numpy     : 1.19.4
dask      : 2.30.0

datashader build from master with commit-id fd938888feca3a42bdfb42462d098f758a954dd8

Description of expected behavior and the observed behavior

I would expect that quadmesh does not try to load the whole DataArray in memory and tries to leverage Dask's out-of-core computation infrastructure, similar to how raster does this. I would expect from the documentation at https://datashader.org/user_guide/Performance.html that Xarray + DaskArray is supported.

Complete, minimal, self-contained example code that reproduces the issue

See notebook at: https://gist.github.com/peterroelants/dd5375ed5d58e1dfd72bc2003539124d

Stack traceback and/or browser JavaScript console output

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 13.04 GB -- Worker memory limit: 10.00 GB

Before Python kernel gets killed because it runs out of memory.

Screenshots of rasterize leveraging Dask's out-of-core:

Screenshots taken when successfully running raster: https://gist.github.com/peterroelants/0624834713a1388c7f57d3cafd9b800b

jbednar commented 3 years ago

Thanks. Could you run the example notebook(s) from https://github.com/holoviz/datashader/pull/885 and see if those work? If they do, can you make your notebook more similar to one of them to see what the difference might be?

BTW, it will be easier to debug if you separate the steps in your pipeline. I.e., instead of:

datashader.transfer_functions.Image(
    datashader.transfer_functions.shade(
        canvas.quadmesh(data_ds, x='x', y='y')
    )
)

do

agg = canvas.quadmesh(data_ds, x='x', y='y')
img = datashader.transfer_functions.shade(agg)
img

Here only the agg step would be distributed, so separating it helps us focus clearly on that part. Also, I'm not sure why you were calling datashader.transfer_functions.Image() on the output of shade(), because shade() output is normally already an Image.

peterroelants commented 3 years ago

Thanks for the references to #885 , I did some tests based on the quadmesh_rectilinear_dask_PR.ipynb from that PR by replacing the matrix with a much larger one. I noticed that it works up until a certain size.

Some general observations:

jbednar commented 3 years ago

Thanks! "and has"?

FYI, the raster and quadmesh code paths in Datashader are entirely independent, for historical reasons. quadmesh support is more limited but is fully written from scratch for the Datashader stack, while the raster code has more interpolation and other features, but was inherited from older code and has only minimally been adapted. So it's surprising that the raster code would be the one with better Dask support. @jonmmease may be able to spot something or suggest something here...

jonmmease commented 3 years ago

I don't have the design fresh in my mind any more, but https://github.com/holoviz/datashader/pull/885 is where Dask support was added.

quadmesh on the full 16384 by 16384 runs out of memory before anything seems to get scheduled on Dask.

Only potential guess here is that something is going wrong with memory usage during auto-range calculations. Do you see the same behavior if you explicitly provide x and y range extents?

peterroelants commented 3 years ago

Only potential guess here is that something is going wrong with memory usage during auto-range calculations. Do you see the same behavior if you explicitly provide x and y range extents?

You mean by creating the canvas with c = ds.Canvas(plot_width=601, plot_height=600, x_range=(xs[0], xs[-1]), y_range=(ys[0], ys[-1]))? I quickly tried this and this did not seem to resolve the issue.

What would be the entrypoints in the codebase to start exploring the differences between how raster and quadmesh perform their computations on Dask?

jbednar commented 3 years ago

Yes, supplying the ranges in that way should have avoided the auto-range calculations, so I think that's not the issue. Seems very mysterious, as if you're hitting some heuristic designed to avoid using too much memory for intermediate values. No idea!

I don't think raster and quadmesh share anything about how they use Dask, but I could be wrong about that; e.g. they may have been written by the same person and could thus share code even though what they are wrapping is entirely different.

jonmmease commented 3 years ago

I don't think raster and quadmesh share anything about how they use Dask

That's correct, it was implemented from scratch inside Datashader's regular architecture. Raster is basically a standalone library with a top-level interface Canvas interface that looks like the other glypyhs, but it doesn't use any of Datashaders aggregation framework.

In terms of the structure. If the xarray dataset has uniformly spaced coordinates and is backed by Dask, you should be falling down this logic path:

https://github.com/holoviz/datashader/blob/a033a2a6c6562f46b5d6ffaddc05a80c6c6b334b/datashader/data_libraries/dask_xarray.py#L136

The core rendering logic is in https://github.com/holoviz/datashader/blob/a033a2a6c6562f46b5d6ffaddc05a80c6c6b334b/datashader/glyphs/quadmesh.py#L341

So first thing to look into is to check if you hit these places in the code, and see if you can work out what operation is triggering the OOM error.

kcpevey commented 3 years ago

It seems like there are ways the code can be improved here but there is no concrete todo item. Adding to the wishlist. If it can be made into a specific request, we can add it to an actual milestone.