Open jbusecke opened 3 years ago
I just ran the example again and the cluster started showing activity fairly quickly, so I am not entirely sure the above example is the best to expose this behavior, but perhaps you still have an idea what could be causing this?
Thanks for the nice example snippet @jbusecke!
Do you have a hunch as to what changes in the recent version could have exacerbated this issue?
Nothing immediately pops out to me. But there has been a lot of recent work on HighLevelGraph
s, transmitting graphs to the scheduler, etc. which might have impacted your workflow.
I tried the example locally on my laptop with two changes:
LocalCluster
instead of GatewayCluster
fsspec
mapper (i.e. mapper = fsspec.get_mapper('test.zarr')
) instead of one which uses gcsfs
using dask
and distributed
2021.02.0 to see if I could attempt to reproduce this long time computation start time. However, locally I saw the task stream become active ~30 seconds after I kicked off the compute with ds.to_zarr(mapper)
. This makes me think that the slow down isn't due purely to the scheduler processing a large graph.
A couple of things immediately come to mind:
mapper
)Client
to the scheduler is a bottleneckTo test whether or not transmitting the graph to the scheduler is a large issue, could you try turning off low-level task fusion? Instead of
ds.to_zarr(mapper)
do
with dask.config.set({"optimization.fuse.active": False}):
ds.to_zarr(mapper)
with should hopefully result in a much smaller graph getting sent over the wire to the scheduler.
Additionally, I see there have been recent releases of fsspec
and gcsfs
. Just for completeness, what versions of these packages are you using?
Stepping back a bit, I suspect that there will be times in the future when you encounter issues when running on pangeo resources and it will be useful for others to try and reproduce them. Does Pangeo have any publicly accessible resources we could use to try and reproduce the issues you run into? I know there's Pangeo cloud and Pangeo's binderhub, but I don't have a good sense for if these are appropriate for this use case
Also cc @ian-r-rose
Stepping back a bit, I suspect that there will be times in the future when you encounter issues when running on pangeo resources and it will be useful for others to try and reproduce them. Does Pangeo have any publicly accessible resources we could use to try and reproduce the issues you run into? I know there's Pangeo cloud and Pangeo's binderhub, but I don't have a good sense for if these are appropriate for this use case
I did run these on the Pangeo Cloud. It only requires a sign up. This would be a good place for all of us to be able to have the same playing field?
Thank you very much for the suggestions. Will try them now.
I did run these on the Pangeo Cloud. It only requires a sign up.
Great! I'll sign up now. Time to dust of my old ORCID...
Using
with dask.config.set({"optimization.fuse.active": False}):
ds.to_zarr(mapper)
indeed cut the wait time down from ~4 min to less than 1 min! Ill try to check that in my full blown workflow to see if this has a similar effect.
I'm curious about the tradeoffs of bypassing optimizations. They might make the computation start faster...but will it run slower?
I didnt run them to completion, but will now 😁
They might make the computation start faster...but will it run slower?
This is a great question to ask! In general things will be slower. Specifically, here are all the array optimizations that are skipped when "optimization.fuse.active"
is turned off. Exactly how much slower things are depends on the particular computation -- though I suspect the last optimization, optimize_slices
, is particularly useful for common Xarray workloads.
Either moving these optimizations to be at the HighLevelGraph
level (similar to the cull
optimization here), or removing the need for a particular optimization altogether with improvements in the distributed scheduler, are part of the ongoing scheduler performance improvement effort (Matt gave a recent talk on this topic and here's a blog post which outlines the main parts of these efforts). Ultimately we want to remove the need for the "optimization.fuse.active"
config option, but we're not there yet.
I was mostly interested in turning off "optimization.fuse.active"
to get a sense for how much of a bottleneck graph transmission from the client to the scheduler is or isn't.
The biggest thing that you'll miss from losing fusion is probably slicing fusion. Dask/Xarray co-evolved a lot of logic to allow slicing on HDF5/NetCDF files to only read in what was necessary. If you're doing full-volume or at least full-chunk data processing then I don't think that you're likely to miss much.
On Tue, Mar 2, 2021 at 5:25 PM James Bourbeau @.***> wrote:
They might make the computation start faster...but will it run slower?
This is a great question to ask! In general things will be slower. Specifically, here are all the array optimizations that are skipped https://github.com/dask/dask/blob/8663c6b7813fbdcaaa85d4fdde04ff42b1bb6ed0/dask/array/optimization.py#L53-L76 when "optimization.fuse.active" is turned off. Exactly how much slower things are depends on the particular computation -- though I suspect the last optimization, optimize_slices, is particularly useful for common Xarray workloads.
Either moving these optimizations to be at the HighLevelGraph level (similar to the cull optimization here https://github.com/dask/dask/blob/8663c6b7813fbdcaaa85d4fdde04ff42b1bb6ed0/dask/highlevelgraph.py#L801), or removing the need for a particular optimization altogether with improvements in the distributed scheduler, are part of the ongoing scheduler performance improvement effort (Matt gave a recent talk on this topic https://www.youtube.com/watch?v=vZ3R1DxTwbA&t and here's a blog post https://blog.dask.org/2020/07/21/faster-scheduling which outlines the main parts of these efforts). Ultimately we want to remove the need for the "optimization.fuse.active" config option, but we're not there yet.
I was mostly interested in turning off "optimization.fuse.active" to get a sense for how much of a bottleneck graph transmission from the client to the scheduler is or isn't.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ocean-transport/coiled_collaboration/issues/2#issuecomment-789296799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFO3LG6CM53ADDP4RDTBVXVHANCNFSM4YPUFQ2Q .
I'm curious, does this problem still persist when turning off fusion? If there is something else going on here then I'd like to get to the bottom of it. If not then I would encourage this group to start operating without fusion (I think that you'll be ok) and we can work towards making that the default on our end.
I havent had time to get back to those test cases yet. In other workflows I have not really noticed this anymore, but Ill try to confirm soonish (backed up by paper revisions this/next week).
Hi everyone,
here is a relatively recent issue that puzzles me (and prevents me from upgrading to the latest dask/distributed versions).
For large computations, it can take very long until a any computation is "started", as judged from nothing happening in the task stream/ProgressBar(for threaded scheduler).
This example (which mimics part of my typical workload with high-resolution ocean model output) for example, has been showing nothing for several minutes now (It has been about 8-10 minutes at the point of writing this).
Then I set up an adaptive dask gateway cluster
I am then trying to write this to the pangeo scratch bucket
I am running this on the pangeo google deployment with a "Large" server.
My versions are:dask:2021.01.1 distributed:2021.01.1
I realize that these datasets are quite large, but they are by no means unrealistic for modern climate/earth system models.
I originally noticed this behavior in one of my research projects when I upgraded from
2020.12.0
version to the latest release (I believe2021.02.x
), and it led me to manually downgrade to get my workflow running since nothing would happen even after I waited for 30+ minutes.