Open jrbourbeau opened 3 years ago
Awesome thanks James! Really interested to see how this turns out 😄
Yeah, I'm also super curious to see if things change, and if so, what they are
On Mon, Mar 15, 2021 at 12:48 PM jakirkham @.***> wrote:
Awesome thanks James! Really interested to see how this turns out 😄
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/quasiben/dask-scheduler-performance/issues/126#issuecomment-799618908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCHVBWDXVOHFC5GEHDTDZB7LANCNFSM4ZG6JTAA .
Would add that we have a custom Dask config, which we supply by command line.
If one wants to generate the callgraphs, would make sure to copy our custom sitecustomize.py
into the environment and use gprof2dot
and dot
(from GraphViz) to generate images of the callgraphs.
Ok, I've gotten these running against Coiled (#130). Not integrated with run-nightly-benchmark.sh
, etc., yet because I'm not sure how the automation should work, but it seems to work as a proof of concept.
Performance reports from a run this morning:
You'll notice that anom-mean
is about twice as slow on Coiled. I'm not sure yet how much this is actually because of the network latencies, or just because the cluster has about half as much memory on Coiled as on your machine, and seems to be close to running out (you can see some spilling to disk). The Coiled Fargate backend only lets us go up to 32G per worker, and I wanted to keep it at 10 workers for comparability, but I imagine there are a few options for getting higher-memory workers if we want them (@jrbourbeau thoughts?).
Also note that every compute
result had to go over the slow wire back down to my laptop, and today seemed to be a particularly slow internet day https://github.com/quasiben/dask-scheduler-performance/pull/130#issuecomment-801409734, so there's some extra latency there (probably not that significant for the results though). When fully automated, maybe we'd want nightly-run.py
to run as a Coiled Job to remove the Internet from the equation?
(not that helpful, but at least you can see the error bar of the durations)
Thanks Gabe! 😄
The "Scheduler Profile (administrative)" tab for shuffle
appears to be blank. Is that expected? Would it be possible to include this information? 🙂
Thanks for catching that. It's also missing from my local copies. In fact, all besides anom-mean
seem to be missing profiles. I haven't seen that before; any idea how it could happen? Either way I'll rerun them and post new ones.
I'm not sure. Rerunning sounds like a good idea. Thanks for looking into 🙂
Update: I've rerun and still am not getting any profiles. I've tried both rerunning on Coiled and running locally, including with 2021.3.0
and 2021.2.0
of dask+distributed. A really simple example in a new script does give profiles though, so I'm trying to find a reproducer. (James, this appears to be connected to the thing I was seeing with worker system stats not updating.)
@jakirkham could the 10sec profiling interval be too long? https://github.com/quasiben/dask-scheduler-performance/blob/74d4eb7080b351f912a284a9eadb80726858f0e4/dask.yaml#L8-L10
Certainly for the fast-running ones, it makes sense that they'd miss the profile window entirely. But I don't understand why shuffle
(which ran for 288sec) wouldn't have a profile either. (Related: https://github.com/dask/distributed/pull/4601)
I also don't see profiles when running main
locally with DASK_CONFIG=dask.yaml
; without config set, all the profiles work. So the config seems to be the culprit.
Update: you're applying dask.yaml
after nightly-run.py
, just for run-shuffle.sh
. I didn't read that script carefully enough the first time around 🤦. So for consistency, I guess we don't actually want any config overrides for now? (Until we work on getting the call graphs out of run-shuffle
.)
I'm rerunning on Coiled right now without the profile config changes, so should have actual profiles in a bit.
Ok with profiles, on Coiled:
simple
shuffle
rand
anon-mean
since it's slow, and already has a profile here: https://github.com/quasiben/dask-scheduler-performance/issues/126#issuecomment-801425675)Thanks for all of the debugging here! 😄
Yeah I think we added the profile value to cutdown the amount of time spent in profiling itself, which shows up in other places as well. Though agree if it doesn't make sense we should change it.
That said, we may add a new benchmark soon. After some pair profiling with @quasiben today, we concluded the current profile needs some updates to better capture the problem we want to optimize
Ah that's very interesting about when the config is applied (only for one of the benchmarks?). I don't think that is actually intended and may explain a bug we've been trying to track down. Would be good to apply the config for both runs. So thanks for spotting that and raising 🙂
Also thanks for the new profiles. Will take a look
Yeah, I encourage us to use significantly faster profile intervals now. Typically this is 10ms rather than 10s. By setting it to 10s we pretty much turned it off. I recommend that we turn it back on now and aim to get at least 1000 samples of a computation, so in the 100ms mark. I think that setting it to 10ms is fine too though.
Thank you for getting this together @gjoseph92 ! Can you also get a baseline for how this performed with dask/distributed at 2.30 ? This is just before we started merging in scheduler changes. Note that when going back in time you need fuse optimizations turned on
(also the --with-cython
flag needs to be dropped on 2.30)
@quasiben will do. Re: the config, do we want to run under uvloop
or not? These benchmarks did not use uvloop
, since I was replicating run-dgx-benchmark.sh
which doesn't activate the config until after the nightly benchmarks have run. But these did use uvloop
, since I didn't realize the config situation back then.
For the 2.30 ones I'm going to continue not using uvloop
to maintain a fair comparison, but we should figure out which we want to go with for the future.
Also, same goes for optimization.fuse.active
: I believe that currently, it's still True for the nightly benchmarks?
Yeah uvloop
is one of the improvements. So we do want to use it on the latest releases, but not for 2.30.0
Edit: Yep we want fuse
as True
on the 2.30.0 builds (making it False
is also one of the improvements)
Also, same goes for optimization.fuse.active: I believe that currently, it's still True for the nightly benchmarks?
Ah, yeah we want optimization.fuse.active
to be False
for the latest benchmarks, but not for the 2.30.0 comparison point where we started all these improvements
Okay got it. To be clear though, I believe that in current benchmarks that are being posted every night, the performance reports and benchmark history plot aren't using optimization.fuse.active=False
or uvloop
. The call graphs, however, are. Just keep that in mind when comparing.
Results from dask/distributed 2.3.0
on Coiled:
Results from dask/distributed 2021.03.0+18.g98148cbf
on Coiled, with optimization.fuse.active: false
, event-loop: uvloop
:
For both of these, the script was running on AWS via Coiled as well (no more slow wire to New Mexico).
So it seems that everything has a modest speedup from 2.3.0
... except amon-mean
, which is 2x slower on main
. Very anecdotally, from occasional glances as the dashboards it seemed that RSS stayed much lower for 2.3.0
compared to main
during that test; maybe 1-2 workers would be spilling, versus all of the bytes stored bars being yellow on main
.
Thanks Gabe! 😄
Just to confirm the latest ones, 2021.03.0+18.g98148cbf
are Cythonized as well and 2.3.0
is not?
Correct to both!
Interesting looking at shuffle
it seems like the Scheduler spends nearly half its time communicating, but the other half seems largely unaccounted for. Am curious what is going on in that other half. Did anything stick out to you while it was running?
@jakirkham I didn't notice anything besides what I mentioned about the memory, but I wasn't looking carefully. Agreed that this is weird, though. As an uneducated guess, could it have anything to do with Cythonization, and that part not being profiled? Though it would seem like an unreasonable amount of time to spend in there. We could try with profile.low-level=True
?
Yeah if --cython=profile
is passed, we track "profile"
and pass this to any Cython extension built. This allows Python profiling tools (like cProfile) to track C function calls Cython generates by adding a little bit of code in each function
That said, agree it is possible that Dask profiling is filtering out functions here. Agree it might be worth playing with Dask profiling configuration to see if we can get more insight into this chunk of time
Small update: using a different backend, I've gotten the benchmarks running on a Coiled cluster with higher-memory workers, giving us a comparable amount of memory to your machine (580GB for us, 540GB for you; before this we were getting 320GB). The good news is that this brings performance on Coiled in line with the nightly benchmarks (a tiny bit faster for all but simple
, in fact). So as we expected, it was memory pressure and spilling to disk that was slowing down the anom-mean
case before.
The bad news is that I still haven't figured out what's happening with the missing time in the profiles. I'm seeing this in both scheduler and worker administrative profiles, and it happens when I run locally as well, so I'll continue trying to track this down.
The good news is that this brings performance on Coiled in line with the nightly benchmarks
Thanks for the update 🎉
The bad news is that I still haven't figured out what's happening with the missing time in the profiles
What version of bokeh
are you using? Or, better yet, could you paste the output of conda list
for your local environment?
Ah, I realized you're talking about a portion of the profile plots which are missing (like in https://github.com/quasiben/dask-scheduler-performance/issues/126#issuecomment-802608125) not that the profile plot is missing altogether.
I've figured out that the "missing" parts of the administrative profiles are caused by uvloop
. Still trying to figure out why that is, though. Here's a reproducer:
# uvloop-profiling-repro.py
import time
import dask
import dask.array as da
import distributed
import xarray as xr
import numpy as np
if __name__ == "__main__":
event_loop = dask.config.get('distributed.admin.event-loop')
print(f"event loop: {event_loop}")
client = distributed.Client()
data = da.random.random((10000, 10000), chunks=(1, 10000))
da = xr.DataArray(data, dims=['time', 'x'],
coords={'day': ('time', np.arange(data.shape[0]) % 10)})
clim = da.groupby('day').mean(dim='time')
anom = da.groupby('day') - clim
anom_mean = anom.mean(dim='time')
with distributed.performance_report(filename=f"report-{event_loop}.html"):
start = time.perf_counter()
anom_mean.compute()
elapsed = time.perf_counter() - start
print(f"{elapsed:.2f} sec")
$ DASK_DISTRIBUTED__ADMIN__EVENT_LOOP=uvloop python uvloop-profiling-repro.py
event loop: uvloop
20.04 sec
$ python uvloop-profiling-repro.py
event loop: tornado
21.32 sec
produces report-uvloop.html
, report-tornado.html
, which show the same pattern we're seeing. I think it's not necessarily that there's "missing" time with uvloop, but rather that the total runtime captured (the base loo.run_forever()
sort of thing) is much longer with uvloop
(14.6sec vs 4.41sec with tornado). If you manually add up the runtimes of the first "interesting" row in the flamechart that (the first with more than one element in it), you'll see that handler_func
+ stream.write
+ comm.read()
adds up to ~3.3sec with uvlopp
; the equivalent with tornado is ~4.4sec, which seems about the same. So some questions to me are:
1) Why is this behavior different with uvloop?
2) Which, if any, are right? For something that took ~20sec to run, why is loop.run_forever()
either 4sec or 16sec?
I haven't read through the profiling code much yet, so these answers may be obvious to someone who's familiar already. I could also move this to an issue on distributed if others think it is indeed a bug.
Interesting. Thanks for the info. So Cythonizing the Scheduler and using Tornado doesn't reproduce the issue? Just want to confirm Cythonization is not part of the problem (or identify if it is)
Great question @jakirkham, and I should have checked that first.
So Cythonizing the Scheduler and using Tornado doesn't reproduce the issue?
The reports I just posted were with the Cythonized scheduler, I should have mentioned that.
It seems not using Cython doesn't change things:
2021.3.0, scheduler in pure Python: report-uvloop.html
, report-tornado.html
However, this appears to be a regression, so it still could be related to all the Cythonization work:
2.3.0: report-uvloop.html
, report-None.html
Well in 2.30 I don't think uvloop
was an option. So configuring it to use uvloop
wouldn't have done anything. IOW Tornado would have still been used
Yeah, that's a good point -- the uvloop
config option was added in 2021.01.1 (xref https://github.com/dask/distributed/pull/4448)
So it sounds like this is a uvloop
thing then
Yup, that's what it seems like to me. I think this is why: https://github.com/dask/distributed/issues/4636
We've been wanting to run the nightly benchmarks on a cluster that's spread across multiple machines. Using our
Dockerfile
we can create a software environment and cluster on Coiled to run the benchmarks across a large multi-node cluster.As a first pass at this one could:
Dockerfile
to make sure that the libraries needed to run the benchmarks are installed (e.g.xarray
,matplotlib
, etc.)peformance_reports
which the benchmark script generates here so we can compare them to the existing performance reports which run on a single machine.cc @gjoseph92 this may be something that interests you as it both involves using Coiled and is related to the ongoing scheduler performance improvement effort