pangeo-data / distributed-array-examples

12 stars 0 forks source link

[Fail case] Almost-blockwise weighted arithmetic vorticity calculation #1

Open TomNicholas opened 2 years ago

TomNicholas commented 2 years ago

Motivation

@rabernat and I have a huge oceanographic dataset (LLC4320) with surface velocity components u and v. We want to calculate the vorticity from these using u*dx - v*dy, which takes into account the size of the grid cells in the x and y directions dx and dy.

This is a specific example of a generic pattern in geoscience / fluid dynamics: apply a simple arithmetic operation weighted by some lower-dimensional constants. In our case we are approximating a differential operator but you might see this pattern in other cases such as weighted averages.

Full problem statement

The full problem we were trying to solve has an additional feature - a rechunking step where each variable gets one more element appended to the end of the array, and then immediately rechunked to merge that length 1 chunk back into the final chunk of the array. This is done by xGCM to deal with boundary conditions, but the only part relevant for this issue is that it's practically a no-op but is nevertheless enough to mean the overall graph is technically no longer "blockwise".

What specifically goes wrong

This computation displays a number of failure modes when executed with dask.distributed.

General considerations

I think this problem is a useful benchmark because it exposes multiple failure modes of the dask.distributed scheduler at once, whilst also allowing you to switch any of these problems off or on by slight changes to the problem definition.

Links

Benchmarks

jrbourbeau commented 2 years ago

Might be a lot better after Gabe's fixes but I haven't tried this yet

We've been running a benchmark that's very similar to your use case and when Gabe's worker saturation is enabled we've see the benchmark run ~75% faster while using ~60% less memory. I'd be interested to see what impact this has on your real-world workload. Happy to help out here however I can @TomNicholas

DrTodd13 commented 2 years ago

The Ramba team has spent some time looking into this example. We didn't previously have the pad function but we implemented it and got this example working with Ramba. With a 4 node cluster with 36 cores per node (2 sockets) and 128GB of memory per node, with data size 5000x5000x300 we are seeing about 5s for ramba and about 65s for dask. Ramba seems to scale quite linearly and we can go up to 5000x5000x600 on these machines in main memory without thrashing. So, if your cluster is big enough then in terms of an in-memory solution ramba seems to be quite a bit faster than dask even if we reduce dask by 75% as in the above comment.

TomNicholas commented 1 year ago

Here's a statement of the full problem (including padding and rechunking) in xarray code, and how cubed performs on it (see the benchmark-vorticity.ipynb) notebook. cc @tomwhite


@DrTodd13 sorry for not replying, that sounds cool! I would love to try and run the same problem with ramba inside xarray, and get a 3-way dask vs cubed vs ramba comparison going.