Dask worker load balance

vprzybylo commented 6 years ago

Hi @mnlevy1981, I could use some help on balancing workers and scaling up my submission so it runs under the 12 hour wall clock time. I have tried submitting the client side pangeo script with both 1 and 2 nodes on shared and economy queue and scaling to 15 workers using jupyter lab. In the ~/.config/dask/jobqueue.yaml file I have 36 cores and 9 processes, 100GB of memory on the economy queue.

After emailing CISL to overlook my job configuration here is one of the log files:

Job ID User Queue Nodes Submit Start Finish Mem(GB) CPU(%) Wall(s) 2974611 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41267 2974612 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41268 2974606 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41268 2974616 przybyl economy 1 14-1615 14-1648 15-0416 7.2 100.0 41268 2974607 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41268 2974615 przybyl economy 1 14-1615 14-1648 15-0416 7.7 100.0 41268 2974618 przybyl economy 1 14-1615 14-1657 15-0416 0.7 0.7 40750 2974619 przybyl economy 1 14-1615 14-1657 15-0416 7.6 100.0 40750 2974620 przybyl economy 1 14-1615 14-1657 15-0416 4.5 82.3 40749 2974613 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41267 2974610 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41267 2974614 przybyl economy 1 14-1615 14-1648 15-0416 7.8 100.0 41267 2974617 przybyl economy 1 14-1615 14-1648 15-0416 7.3 100.0 41267 2974608 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41267 2974609 przybyl economy 1 14-1615 14-1648 15-0416 0.7 2.7 41267 2974604 przybyl economy 2 14-1613 14-1614 15-0415 0.0 0.1 43273

Here, there are sixteen jobs, so it seems that when I scale to 15 workers, dask is placing each worker on its own node, instead of putting 9-per-node as the dask config would seem to indicate. Or perhaps dask is putting more than 1 worker on the jobs where the CPU load is high and no workers on the jobs were the CPU load is almost 0%.

Any advice on how to balance the workers and/or scale up this job in the most economical way possible is greatly appreciated!

mnlevy1981 commented 6 years ago

Which version of job-queue are you using? If recently updated NCAR/pangeo to use 0.4.0 instead of 0.3.0... in the old version

cluster.scale(N)

Requests N nodes, but in the newer version it requests N workers (N/9 nodes with the default jobqueue.yaml settings).

That said, I'm not terribly familiar with trouble-shooting dask, so maybe @jhamman would have an idea on what's going on?

vprzybylo commented 6 years ago

What is the best way to determine the version? I have tried quite a few commands but I can't seem to find the version via command line. Being that my processes are 9 in the jobqueue.yaml file I would guess I am using 0.4.0 but I have not performed a git pull since about a month ago.

mnlevy1981 commented 6 years ago

If you activate your conda environment, you should be able to get the version number from pip:

$ source activate pangeo-cheyenne
(pangeo-cheyenne) $ pip list | grep jobqueue
dask-jobqueue        0.4.0
(pangeo-cheyenne) $

vprzybylo commented 6 years ago

I forgot to activate the environment first since it is usually done in the script, whoops! It is version 0.3.0 instead of what I assumed.

jhamman commented 6 years ago

I would first make sure you've updated to 0.4 so we're all speaking the same language.

I would then recommend working through the "How to debug" page on the jobqueue docs: https://dask-jobqueue.readthedocs.io/en/latest/debug.html#checking-job-script

If you can report back with a) what your configuration options are, 2) how you are instantiating the cluster object, and 3) what your job script looks like, I think we'll have a better chance of helping you work through your issue.

mnlevy1981 commented 6 years ago

Thanks @jhamman!

@vprzybylo -- I think the easiest way to upgrade would be to do the following wherever you cloned NCAR/pangeo on cheyenne (comments after #):

$ git status # just make sure you haven't changed anything by hand
$ git pull # update to 39039c6
$ conda remove --name pangeo-cheyenne --all # delete existing environment
$ cd server-side/
$ conda env create -f pangeo-cheyenne.yaml

You should be able to run the same tests as my previous comments to ensure you're up to 0.4.0 (and note at that point you'll get 1/9th the number of nodes because of the change in cluster.scale())

vprzybylo commented 6 years ago

Thanks, @jhamman and @mnlevy1981,

After creating the new environment for 0.4.0 I am running: ./launch-pangeo-on-cheyenne -u przybylo -A UALB0017 -e pangeo -w 12:00:00

The only thing that has changed is that I tried a run with default settings of 1 node and 1 core to start the interactive lab (before I had it set to 36 cores, which is probably excessive for starting the notebook). Nothing has changed in the .yaml config file partially copied below:

jobqueue: pbs: name: dask-worker

# Dask worker options
cores: 36                 # Total number of cores per job
memory: '109 GB'          # Total amount of memory per job
processes: 9              # Number of Python processes per job

...

In the notebook I then instantiate the cluster as: cluster=PBSCluster(queue='economy', project='UALB0017', walltime='12:00:00') client = Client(cluster)

And scale the clusters (now using 0.4.0 with 9 x workers) as: cluster.scale(9 times 15)

After the workers come online, my main call looks like:

output = [delayed(lab.parallelize_clusters)(phio=x, save_plots=save_plots, minor=minor,nclusters=nclusters, ncrystals=ncrystals, numaspectratios=numaspectratios, speedy=speedy, rand_orient=rand_orient, ch_dist=ch_dist) for x in phio]

results = client.compute(output)

with open('futures_complete.txt', 'w') as f3:
    for future, result in as_completed(results, with_results=True):
        print(future)
        print('completed: ', count)            
        f3.write(str(count))
        f3.write("/n")
        count += 1

results = client.gather(results)

In the system account manager (SAM) I can see the job submissions but I don't believe I have access to the CPU usage outside of the dask dashboard, which I'm not 100% confident in since I have only been able to see the task manager colors for each function on only on a few past runs, and when I do it looks to be efficient with no gaps in color. Nevertheless, the CPU usage is maxed out for only about a quarter of the workers. The log files posted above were from a CISL staff member who I had asked to look over my config before aimlessly using up core hours.

As far as debugging, I know there are no errors in my script since smaller jobs have completed and my print statements output from the dask worker files have made it to the end. Also, I have been writing out the future statuses and all 50 have completed on previous smaller runs, no 'error' output associated. It seems to simply not get to the end for more resource intensive tasks before the 12 hour wall clock is exceeded, but I'm not sure about the scheduler overhead or if there is a bottleneck.

The only error in the dask.o.. log file is: distributed.diskutils - ERROR - Failed to remove '/gpfs/u/home/przybylo/IPAS_array/dask-worker-space/worke r-7j_tz2d0' (failed in ): [Errno 2] No such file or directory: '/gpfs/u/home/przy bylo/IPAS_array/dask-worker-space/worker-7j_tz2d0'

but that doesn't seem to be an issue as it keeps going until 'nanny is closed.' My guess is that the worker partition is not properly set up.

Thanks guys!

jhamman commented 6 years ago

@vprzybylo - I just read your most recent post but I'm still not sure where you're looking for help. I think it would be useful for you to start with a smaller example and incrementally scale that up until things break.

One question I have is how many tasks do you have in your graph? You have 135 workers but do you have that many tasks? Next, are you serializing any large data objects (doesn't look like it but that can cause slow downs)? Finally, looking at your worker CPU usage, it seems you have a few workers using more than their share (>400% cpu). Any idea why that may be happenin?

vprzybylo commented 6 years ago

I am 'pickling' a large object after the worker results are gathered because I need to be able to access the output data to make plots after the run completes. But based on the order of execution I believe I should see all of the futures completed first, which isn't the case.

I was using 15 workers in version 0.3.0 and the loop I am parallelizing iterates 50 times, so 50 tasks would be created. And there are numerous functions within each iteration since the parallelization is on the outer most loop. Based on the change in 0.4.0 to cluster.scale I multiplied the 15 workers by 9.

The CPU usage question you pose was my largest concern since some of the workers are well over 400% and some are near 0%. I will go back to some smaller examples to see how many CPU's are being used and continuously scale up as you suggest.

I do have a question in the meantime while I perform smaller tasks: can I add workers mid-run that come online to assist or do I need to wait for them all to start running in the queue and then run the main call?

Thanks

jhamman commented 6 years ago

Yes, you can scale up the worker pool after its creation. The compute method blocks until complete so you have to rework things in order to do what you want here. What you may want to do is use the adaptive cluster in this case. So rather than calling cluster.scale(30), call cluster.adapt(minimum=30, maximum=...).

If we're going to go back and forth a bit on this issue, I'd prefer we move the conversation over to the Pangeo issue tracker so it has more visibility. You'll likely get better feedback there too.

vprzybylo commented 6 years ago

Perfect, I was looking into cluster.adapt for this reason anyway, so thanks for confirming.

I didn't know the best place to submit my questions, sorry! Is there a way to directly move this to the Pangeo issue tracker or should I open up a new issue over there (assuming I may need more help) in the future?

jhamman commented 6 years ago

I suggest opening a new issue if you need more input. I'm glad adapt seems to be what you're looking for.

mnlevy1981 / pangeo-tools

Dask worker load balance #1