pangeo-data / pangeo-tutorial-gallery

Repo to house pangeo-tutorial notebooks for pangeo-gallery
MIT License
10 stars 13 forks source link

Suggestion: use`n_workers=2` when starting Dask cluster #12

Open paigem opened 3 years ago

paigem commented 3 years ago

Currently in the dask.ipynb dask tutorial, a cluster is spun up with only one worker. This doesn't yield an errors, but I think it is suboptimal for demonstrating the dask.delayed functionality near the end of the notebook. With only one worker, the following two blocks of code run in the same amount of time (~400ms):

# Define functions
def inc(x):
    time.sleep(0.1)
    return x + 1

def dec(x):
    time.sleep(0.1)
    return x - 1

def add(x, y):
    time.sleep(0.2)
    return x + y 

# Run without using Dask
x = inc(1)
y = dec(2)
z = add(x, y)
z
# Run using Dask
inc = dask.delayed(inc)
dec = dask.delayed(dec)
add = dask.delayed(add)

x = inc(1)
y = dec(2)
z = add(x, y)
z.compute()

With (at least) two workers, the second block runs in ~300ms, and shows the user that x and y can be computed in parallel (which can be seen on the Dask dashboard) and thus decrease the amount of time it takes to run this example.

So I suggest adding n_workers=2 to the following lines of code at near the beginning of the tutorial:

from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=2)

Alternatively, we could spin up a new cluster with 2 workers specifically for the dask.delayed portion of the tutorial. Any preferences from folks, e.g. @rabernat?

Again, this was noticed during a workshop by @NickMortimer during the Dask Summit this week. 🙂