Run benchmarks on Coiled

jrbourbeau commented 3 years ago

We've been wanting to run the nightly benchmarks on a cluster that's spread across multiple machines. Using our Dockerfile we can create a software environment and cluster on Coiled to run the benchmarks across a large multi-node cluster.

As a first pass at this one could:

Add a few updates to our existing Dockerfile to make sure that the libraries needed to run the benchmarks are installed (e.g. xarray, matplotlib, etc.)
Create a container-only software environment on Coiled which uses our custom docker image
Spin up a cluster on Coiled which uses this container-only software environment
Run our nightly benchmarks on the Coiled cluster
Post the corresponding peformance_reports which the benchmark script generates here so we can compare them to the existing performance reports which run on a single machine.

cc @gjoseph92 this may be something that interests you as it both involves using Coiled and is related to the ongoing scheduler performance improvement effort

jakirkham commented 3 years ago

Awesome thanks James! Really interested to see how this turns out 😄

mrocklin commented 3 years ago

Yeah, I'm also super curious to see if things change, and if so, what they are

On Mon, Mar 15, 2021 at 12:48 PM jakirkham @.***> wrote:

Awesome thanks James! Really interested to see how this turns out 😄

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/quasiben/dask-scheduler-performance/issues/126#issuecomment-799618908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCHVBWDXVOHFC5GEHDTDZB7LANCNFSM4ZG6JTAA .

jakirkham commented 3 years ago

Would add that we have a custom Dask config, which we supply by command line.

If one wants to generate the callgraphs, would make sure to copy our custom sitecustomize.py into the environment and use gprof2dot and dot (from GraphViz) to generate images of the callgraphs.

gjoseph92 commented 3 years ago

Ok, I've gotten these running against Coiled (#130). Not integrated with run-nightly-benchmark.sh, etc., yet because I'm not sure how the automation should work, but it seems to work as a proof of concept.

Performance reports from a run this morning:

You'll notice that anom-mean is about twice as slow on Coiled. I'm not sure yet how much this is actually because of the network latencies, or just because the cluster has about half as much memory on Coiled as on your machine, and seems to be close to running out (you can see some spilling to disk). The Coiled Fargate backend only lets us go up to 32G per worker, and I wanted to keep it at 10 workers for comparability, but I imagine there are a few options for getting higher-memory workers if we want them (@jrbourbeau thoughts?).

Also note that every compute result had to go over the slow wire back down to my laptop, and today seemed to be a particularly slow internet day https://github.com/quasiben/dask-scheduler-performance/pull/130#issuecomment-801409734, so there's some extra latency there (probably not that significant for the results though). When fully automated, maybe we'd want nightly-run.py to run as a Coiled Job to remove the Internet from the equation?

20210317-benchmark-history-coiled (not that helpful, but at least you can see the error bar of the durations)

Output

``` simple 1.291e+00 +/- 2.601e-01 shuffle 2.794e+01 +/- 3.854e+00 rand_access 2.467e-01 +/- 9.742e-02 anom_mean 3.704e+02 +/- 9.126e+01 ## Raw Values simple [1.75858582 1.50355149 1.74379997 1.06905734 1.12978183 1.06219367 1.16227288 1.25264611 1.09449319 1.13785423] shuffle [23.27284773 22.92598115 26.89265755 31.65077646 26.76775415 25.47660678 28.02095303 35.77000727 31.96979398 26.62711623] rand_access [0.225034 0.22522116 0.23306179 0.22801113 0.23228598 0.10311818 0.23311377 0.23831201 0.51537108 0.233042 ] anom_mean [216.58301497 284.30850673 284.672225 337.90460682 355.02117991 362.10879326 504.11227417 508.33712792 409.31720185 441.69501901] ```

jakirkham commented 3 years ago

Thanks Gabe! 😄

The "Scheduler Profile (administrative)" tab for shuffle appears to be blank. Is that expected? Would it be possible to include this information? 🙂

gjoseph92 commented 3 years ago

Thanks for catching that. It's also missing from my local copies. In fact, all besides anom-mean seem to be missing profiles. I haven't seen that before; any idea how it could happen? Either way I'll rerun them and post new ones.

jakirkham commented 3 years ago

I'm not sure. Rerunning sounds like a good idea. Thanks for looking into 🙂

gjoseph92 commented 3 years ago

Update: I've rerun and still am not getting any profiles. I've tried both rerunning on Coiled and running locally, including with 2021.3.0 and 2021.2.0 of dask+distributed. A really simple example in a new script does give profiles though, so I'm trying to find a reproducer. (James, this appears to be connected to the thing I was seeing with worker system stats not updating.)

gjoseph92 commented 3 years ago

@jakirkham could the 10sec profiling interval be too long? https://github.com/quasiben/dask-scheduler-performance/blob/74d4eb7080b351f912a284a9eadb80726858f0e4/dask.yaml#L8-L10

Certainly for the fast-running ones, it makes sense that they'd miss the profile window entirely. But I don't understand why shuffle (which ran for 288sec) wouldn't have a profile either. (Related: https://github.com/dask/distributed/pull/4601)

I also don't see profiles when running main locally with DASK_CONFIG=dask.yaml; without config set, all the profiles work. So the config seems to be the culprit.

Update: you're applying dask.yaml after nightly-run.py, just for run-shuffle.sh. I didn't read that script carefully enough the first time around 🤦. So for consistency, I guess we don't actually want any config overrides for now? (Until we work on getting the call graphs out of run-shuffle.)

I'm rerunning on Coiled right now without the profile config changes, so should have actual profiles in a bit.

gjoseph92 commented 3 years ago

Ok with profiles, on Coiled:

simple
shuffle
rand
(didn't run anon-mean since it's slow, and already has a profile here: https://github.com/quasiben/dask-scheduler-performance/issues/126#issuecomment-801425675)

jakirkham commented 3 years ago

Thanks for all of the debugging here! 😄

Yeah I think we added the profile value to cutdown the amount of time spent in profiling itself, which shows up in other places as well. Though agree if it doesn't make sense we should change it.

That said, we may add a new benchmark soon. After some pair profiling with @quasiben today, we concluded the current profile needs some updates to better capture the problem we want to optimize

Ah that's very interesting about when the config is applied (only for one of the benchmarks?). I don't think that is actually intended and may explain a bug we've been trying to track down. Would be good to apply the config for both runs. So thanks for spotting that and raising 🙂

Also thanks for the new profiles. Will take a look

mrocklin commented 3 years ago

Yeah, I encourage us to use significantly faster profile intervals now. Typically this is 10ms rather than 10s. By setting it to 10s we pretty much turned it off. I recommend that we turn it back on now and aim to get at least 1000 samples of a computation, so in the 100ms mark. I think that setting it to 10ms is fine too though.

quasiben commented 3 years ago

Thank you for getting this together @gjoseph92 ! Can you also get a baseline for how this performed with dask/distributed at 2.30 ? This is just before we started merging in scheduler changes. Note that when going back in time you need fuse optimizations turned on

jakirkham commented 3 years ago

(also the --with-cython flag needs to be dropped on 2.30)

gjoseph92 commented 3 years ago

@quasiben will do. Re: the config, do we want to run under uvloop or not? These benchmarks did not use uvloop, since I was replicating run-dgx-benchmark.sh which doesn't activate the config until after the nightly benchmarks have run. But these did use uvloop, since I didn't realize the config situation back then.

For the 2.30 ones I'm going to continue not using uvloop to maintain a fair comparison, but we should figure out which we want to go with for the future.

Also, same goes for optimization.fuse.active: I believe that currently, it's still True for the nightly benchmarks?

jakirkham commented 3 years ago

Yeah uvloop is one of the improvements. So we do want to use it on the latest releases, but not for 2.30.0

Edit: Yep we want fuse as True on the 2.30.0 builds (making it False is also one of the improvements)

jrbourbeau commented 3 years ago

Also, same goes for optimization.fuse.active: I believe that currently, it's still True for the nightly benchmarks?

Ah, yeah we want optimization.fuse.active to be False for the latest benchmarks, but not for the 2.30.0 comparison point where we started all these improvements

gjoseph92 commented 3 years ago

Okay got it. To be clear though, I believe that in current benchmarks that are being posted every night, the performance reports and benchmark history plot aren't using optimization.fuse.active=False or uvloop. The call graphs, however, are. Just keep that in mind when comparing.

Results from dask/distributed 2.3.0 on Coiled:

Results from dask/distributed 2021.03.0+18.g98148cbf on Coiled, with optimization.fuse.active: false, event-loop: uvloop:

For both of these, the script was running on AWS via Coiled as well (no more slow wire to New Mexico).

So it seems that everything has a modest speedup from 2.3.0... except amon-mean, which is 2x slower on main. Very anecdotally, from occasional glances as the dashboards it seemed that RSS stayed much lower for 2.3.0 compared to main during that test; maybe 1-2 workers would be spilling, versus all of the bytes stored bars being yellow on main.

jakirkham commented 3 years ago

Thanks Gabe! 😄

Just to confirm the latest ones, 2021.03.0+18.g98148cbf are Cythonized as well and 2.3.0 is not?

gjoseph92 commented 3 years ago

Correct to both!

jakirkham commented 3 years ago

Interesting looking at shuffle it seems like the Scheduler spends nearly half its time communicating, but the other half seems largely unaccounted for. Am curious what is going on in that other half. Did anything stick out to you while it was running?

gjoseph92 commented 3 years ago

@jakirkham I didn't notice anything besides what I mentioned about the memory, but I wasn't looking carefully. Agreed that this is weird, though. As an uneducated guess, could it have anything to do with Cythonization, and that part not being profiled? Though it would seem like an unreasonable amount of time to spend in there. We could try with profile.low-level=True?

jakirkham commented 3 years ago

Yeah if --cython=profile is passed, we track "profile" and pass this to any Cython extension built. This allows Python profiling tools (like cProfile) to track C function calls Cython generates by adding a little bit of code in each function

That said, agree it is possible that Dask profiling is filtering out functions here. Agree it might be worth playing with Dask profiling configuration to see if we can get more insight into this chunk of time

gjoseph92 commented 3 years ago

Small update: using a different backend, I've gotten the benchmarks running on a Coiled cluster with higher-memory workers, giving us a comparable amount of memory to your machine (580GB for us, 540GB for you; before this we were getting 320GB). The good news is that this brings performance on Coiled in line with the nightly benchmarks (a tiny bit faster for all but simple, in fact). So as we expected, it was memory pressure and spilling to disk that was slowing down the anom-mean case before.

Output

The bad news is that I still haven't figured out what's happening with the missing time in the profiles. I'm seeing this in both scheduler and worker administrative profiles, and it happens when I run locally as well, so I'll continue trying to track this down.

jrbourbeau commented 3 years ago

The good news is that this brings performance on Coiled in line with the nightly benchmarks

Thanks for the update 🎉

The bad news is that I still haven't figured out what's happening with the missing time in the profiles

What version of bokeh are you using? Or, better yet, could you paste the output of conda list for your local environment?

jrbourbeau commented 3 years ago

Ah, I realized you're talking about a portion of the profile plots which are missing (like in https://github.com/quasiben/dask-scheduler-performance/issues/126#issuecomment-802608125) not that the profile plot is missing altogether.

gjoseph92 commented 3 years ago

I've figured out that the "missing" parts of the administrative profiles are caused by uvloop. Still trying to figure out why that is, though. Here's a reproducer:

# uvloop-profiling-repro.py
import time

import dask
import dask.array as da
import distributed
import xarray as xr
import numpy as np

if __name__ == "__main__":
    event_loop = dask.config.get('distributed.admin.event-loop')
    print(f"event loop: {event_loop}")
    client = distributed.Client()

    data = da.random.random((10000, 10000), chunks=(1, 10000))
    da = xr.DataArray(data, dims=['time', 'x'],
                        coords={'day': ('time', np.arange(data.shape[0]) % 10)})
    clim = da.groupby('day').mean(dim='time')
    anom = da.groupby('day') - clim
    anom_mean = anom.mean(dim='time')
    with distributed.performance_report(filename=f"report-{event_loop}.html"):
        start = time.perf_counter()
        anom_mean.compute()
        elapsed = time.perf_counter() - start
        print(f"{elapsed:.2f} sec")

$ DASK_DISTRIBUTED__ADMIN__EVENT_LOOP=uvloop python uvloop-profiling-repro.py
event loop: uvloop
20.04 sec
$ python uvloop-profiling-repro.py
event loop: tornado
21.32 sec

produces report-uvloop.html, report-tornado.html, which show the same pattern we're seeing. I think it's not necessarily that there's "missing" time with uvloop, but rather that the total runtime captured (the base loo.run_forever() sort of thing) is much longer with uvloop (14.6sec vs 4.41sec with tornado). If you manually add up the runtimes of the first "interesting" row in the flamechart that (the first with more than one element in it), you'll see that handler_func + stream.write + comm.read() adds up to ~3.3sec with uvlopp; the equivalent with tornado is ~4.4sec, which seems about the same. So some questions to me are: 1) Why is this behavior different with uvloop? 2) Which, if any, are right? For something that took ~20sec to run, why is loop.run_forever() either 4sec or 16sec?

I haven't read through the profiling code much yet, so these answers may be obvious to someone who's familiar already. I could also move this to an issue on distributed if others think it is indeed a bug.

jakirkham commented 3 years ago

Interesting. Thanks for the info. So Cythonizing the Scheduler and using Tornado doesn't reproduce the issue? Just want to confirm Cythonization is not part of the problem (or identify if it is)

gjoseph92 commented 3 years ago

Great question @jakirkham, and I should have checked that first.

So Cythonizing the Scheduler and using Tornado doesn't reproduce the issue?

The reports I just posted were with the Cythonized scheduler, I should have mentioned that.

It seems not using Cython doesn't change things: 2021.3.0, scheduler in pure Python: report-uvloop.html, report-tornado.html

However, this appears to be a regression, so it still could be related to all the Cythonization work: 2.3.0: report-uvloop.html, report-None.html

jakirkham commented 3 years ago

Well in 2.30 I don't think uvloop was an option. So configuring it to use uvloop wouldn't have done anything. IOW Tornado would have still been used

jrbourbeau commented 3 years ago

Yeah, that's a good point -- the uvloop config option was added in 2021.01.1 (xref https://github.com/dask/distributed/pull/4448)

jakirkham commented 3 years ago

So it sounds like this is a uvloop thing then

gjoseph92 commented 3 years ago

Yup, that's what it seems like to me. I think this is why: https://github.com/dask/distributed/issues/4636

quasiben / dask-scheduler-performance

Run benchmarks on Coiled #126