Open peterroelants opened 3 years ago
Thanks. Could you run the example notebook(s) from https://github.com/holoviz/datashader/pull/885 and see if those work? If they do, can you make your notebook more similar to one of them to see what the difference might be?
BTW, it will be easier to debug if you separate the steps in your pipeline. I.e., instead of:
datashader.transfer_functions.Image(
datashader.transfer_functions.shade(
canvas.quadmesh(data_ds, x='x', y='y')
)
)
do
agg = canvas.quadmesh(data_ds, x='x', y='y')
img = datashader.transfer_functions.shade(agg)
img
Here only the agg step would be distributed, so separating it helps us focus clearly on that part. Also, I'm not sure why you were calling datashader.transfer_functions.Image()
on the output of shade()
, because shade()
output is normally already an Image.
Thanks for the references to #885 , I did some tests based on the quadmesh_rectilinear_dask_PR.ipynb
from that PR by replacing the matrix with a much larger one. I noticed that it works up until a certain size.
quadmesh
figure based on a dask array: https://gist.github.com/peterroelants/8b50a310dfe003aa6ab5370d62d747be .quadmesh
fails with a 16384 by 16384 matrix on my machine by killing the Python Kernel due to being out of memory: https://gist.github.com/peterroelants/b62cdb06c3e00da1c456833cf83ffc31 . raster
works with a 16384 by 16384 matrix on my machine: https://gist.github.com/peterroelants/ac5e5e55f5ceac675923e2e63d078836Some general observations:
quadmesh
on the full 16384 by 16384 runs out of memory before anything seems to get scheduled on Dask. raster
works fine on this matrix.quadmesh
it only seems to fetch the data
by computing the random.random
. When looking at the scheduler when running raster
much more happens on the Dask graph. This leads me to think that quadmesh
isn't able to properly leverage Dask's out-of-core computation capabilities for some reason.Thanks! "and has"?
FYI, the raster and quadmesh code paths in Datashader are entirely independent, for historical reasons. quadmesh support is more limited but is fully written from scratch for the Datashader stack, while the raster code has more interpolation and other features, but was inherited from older code and has only minimally been adapted. So it's surprising that the raster code would be the one with better Dask support. @jonmmease may be able to spot something or suggest something here...
I don't have the design fresh in my mind any more, but https://github.com/holoviz/datashader/pull/885 is where Dask support was added.
quadmesh on the full 16384 by 16384 runs out of memory before anything seems to get scheduled on Dask.
Only potential guess here is that something is going wrong with memory usage during auto-range calculations. Do you see the same behavior if you explicitly provide x and y range extents?
Only potential guess here is that something is going wrong with memory usage during auto-range calculations. Do you see the same behavior if you explicitly provide x and y range extents?
You mean by creating the canvas with c = ds.Canvas(plot_width=601, plot_height=600, x_range=(xs[0], xs[-1]), y_range=(ys[0], ys[-1]))
? I quickly tried this and this did not seem to resolve the issue.
What would be the entrypoints in the codebase to start exploring the differences between how raster
and quadmesh
perform their computations on Dask?
Yes, supplying the ranges in that way should have avoided the auto-range calculations, so I think that's not the issue. Seems very mysterious, as if you're hitting some heuristic designed to avoid using too much memory for intermediate values. No idea!
I don't think raster and quadmesh share anything about how they use Dask, but I could be wrong about that; e.g. they may have been written by the same person and could thus share code even though what they are wrapping is entirely different.
I don't think raster and quadmesh share anything about how they use Dask
That's correct, it was implemented from scratch inside Datashader's regular architecture. Raster is basically a standalone library with a top-level interface Canvas
interface that looks like the other glypyhs, but it doesn't use any of Datashaders aggregation framework.
In terms of the structure. If the xarray dataset has uniformly spaced coordinates and is backed by Dask, you should be falling down this logic path:
The core rendering logic is in https://github.com/holoviz/datashader/blob/a033a2a6c6562f46b5d6ffaddc05a80c6c6b334b/datashader/glyphs/quadmesh.py#L341
So first thing to look into is to check if you hit these places in the code, and see if you can work out what operation is triggering the OOM error.
It seems like there are ways the code can be improved here but there is no concrete todo item. Adding to the wishlist. If it can be made into a specific request, we can add it to an actual milestone.
My Python kernel gets killed because of an out-of-memory issue when generating a
quadmesh
from a large Xarray DataArray using DaskArrays as data. Visualising the same DataArray withraster
works using Dask's out-of-core computation support.When looking at the Dask scheduler I noticed that nothing gets scheduled when creating the
quadmesh
(I think because the whole DataArray is forced into memory). While usingraster
shows a nice computation graph and is able to compute the figure out-of-core without any issues.To illustrate this I created two notebooks trying to render a figure from the same DataArray
raster
: https://gist.github.com/peterroelants/0624834713a1388c7f57d3cafd9b800bquadmesh
: https://gist.github.com/peterroelants/dd5375ed5d58e1dfd72bc2003539124dALL software version info
datashader
build frommaster
with commit-idfd938888feca3a42bdfb42462d098f758a954dd8
Description of expected behavior and the observed behavior
I would expect that
quadmesh
does not try to load the whole DataArray in memory and tries to leverage Dask's out-of-core computation infrastructure, similar to howraster
does this. I would expect from the documentation at https://datashader.org/user_guide/Performance.html that Xarray + DaskArray is supported.Complete, minimal, self-contained example code that reproduces the issue
See notebook at: https://gist.github.com/peterroelants/dd5375ed5d58e1dfd72bc2003539124d
Stack traceback and/or browser JavaScript console output
Before Python kernel gets killed because it runs out of memory.
Screenshots of rasterize leveraging Dask's out-of-core:
Screenshots taken when successfully running
raster
: https://gist.github.com/peterroelants/0624834713a1388c7f57d3cafd9b800brasterize
tasks (This does not happen withquadmesh
):rasterize
task graph in the Dask scheduler (This stays at "Scheduler is empty" when runningquadmesh
):