Using cuda cluster on HPC: Implementation question for a PBSCudaCluster/Job using LocalCUDACluster

rapidsai / dask-cuda

Utilities for Dask and CUDA interactions

https://docs.rapids.ai/api/dask-cuda/stable/

Apache License 2.0

287 stars 91 forks source link

Using cuda cluster on HPC: Implementation question for a PBSCudaCluster/Job using LocalCUDACluster #653

Open MordicusEtCubitus opened 3 years ago

MordicusEtCubitus commented 3 years ago

Dears,

First I would like to thank you for the great work you have done to help using Dask+GPU. The LocalCUDACluster is really great and makes things much more simple.

Actually I have been able to run it quite easily on a single computer having many GPU or on a local network with many computers having one gpu.

That's great!

I also work a few on HPC using Dask/Jobqueue with PBS or Slurm for CPU computing - and that's fine too.

Now, you may have guessed: I would like to run the Cuda Cluster on a HPC having many nodes having many GPU each. I've seen that's you have think about many features in this way in the scheduler/worker configuration, like using infiniband.

Actually, I've been able to run CudaCluster on a single HPC node having many GPU by executing a simple python script with the usual PBS/Slurm way. So I can run a CudaLocalCluster on a single node. A first step.

But I don't clearly understand how I can start a PBS/Slurm job running CudaCluster/Workers on many nodes, as PBSCluster/SlurmCluster are doing for CPU.

Cant it be done, and, if yes, how ?

After having a look inside PBSCluster class, it appears that it only defines a static job attribute from PBSJob class.

This PBSJob class is in charge of running the worker on each node started by the pbs classical job script.
The PBSJob is executed with the command python -m distributed.cli.dask_worker runned on each node

So, if I want to run workers using CudaLocalCluster,

do I only have to create a PBSCudaJob class based on PBSJob which replace the distributed.cli.dask by dask_cuda.cli.dask_cuda_worker ?
And of course creating a PBSCudaCluster class defining the static job as PBSCudaJob ?

Seems near to be simple. I do not have access to the HPC right now, and will appreciate to get your comment before investing on this try.

Another question regarding the setting of distributed.cli.dask_worker in Job class. This is hard coded in the Job super class and not stored in an attribute, but can be retrieved from Job instance._command_template attribute

So in the PBSCudaJob.__init__ I could try to replace the string 'distributed.cli.dask_worker' by 'dask_cuda.cli.dask_cuda_worker' in self._command_template, this is not a best practice. Do you have a better suggestion ? Do I have understood the work to implement or do I have missed a lot of steps ?

Thanks for your help.

Gaël,

quasiben commented 3 years ago

I think it's a bit of work but definitely doable!

Before getting to cuda-workers, I think, we first need to resolve starting the scheduler in the cluster. Right now the scheduler starts in the process where the cluster is created . Perhaps this is ok if you are launching jobs from within the cluster itself (I believe this is what happens at ORNL for example). I attempted a PR for this last year, but it has since languished. The inspiration fro deploy_mode=local came from the k8s-dask machinery which tries to answer similar configuration questions of where the scheduler is (in client process or a pod in the k8s cluster)
I think it would be great, though not necessary for dask-jobqueue to support the dask-spec cli. We recently went through this change in dask-yarn . Again, referring to the old dask-jobqueue PR a deploy_command was added but I do think it's nicer if we support the dask-spec cli. Switching to the new cli allows users to easily choose the worker_class. Either: dask_cuda.CUDAWorker, or dask.distributed.Nanny and many other worker options. Note we also do this in dask-cloudprovider

@andersy005 @guillaumeeb if you have additional thoughts here or corrections

MordicusEtCubitus commented 3 years ago

Regarding the first point - starting scheduler on a cluster node - I have this simple suggestion:

Most programs on HPC are executed using MPI
There is already an MPI way to launch Dask on HPC - (I have not looked at it yet in the details so suggestion below may be redundant)
MPI still an old, reliable way to do things (but quite laborious)
We could execute both scheduler and workers by creating the cluster with an MPI Python script:
- We launch an MPI python script on N (workers) + 1 (for the scheduler) nodes
- The script having rank 0 starts the scheduler
- It then sends its IP to others processes in the comm world
- Others scripts start the workers with the IP and every information gathered from the scheduler

Sounds easy and without too much risk as it use classical techniques already working on HPC and with MPI-Dask. But here are a few pending questions I think about (I'm not really used with MPI):

Should the MPI script run the scheduler and workers inside the MPI session ? Seems more natural.
Or should it starts the scheduler/worker by running a system script ? With subprocess, for example.
Could we use this MPI comm world to spawn the dashboard in a separate process ?
Actually it runs in the same process than the scheduler, slowing the scheduler on heavy load, and so need to be disabled.
But in this case the scheduler could inform the dashboard of its actual status by sending data only when it has time for

In option 1) scheduler and workers could communicate with usual way or continue using MPI primitives Maybe it is a double advantage, maybe it will be worse by costing too much in resources, in this case option 2 is a work around

In option 2) workers and the scheduler will be usual system processes, outside the MPI script and will no more care about it. What about the MPI process, should it return immediately ?

In this case, HPC workload manager (PBS, Slurm, ...) will think job is finished and may kill spawn processes (I don't think) or it will think that nodes are free for new tasks and send new jobs on the nodes actually running Dask cluster. Won't be good.

How the client executed from Jupyter on the login node (usually outside the HPC cluster) will get back the scheduler address from the PBS job to connect to it ?

MordicusEtCubitus commented 3 years ago

Regarding the second point : I agree.

So if we can define a good way of how to run the full Dask cluster including the scheduler in the HPC cluster and use it from a client node, I will be pleased to give it a try and test it on each HPC I can.

Just to express my feelings : there are a lot of ways to start Dask : local machine, distributed and manually on a network, on the cloud, on hpc with jobqueue, using mpi, ...

That's fine and this is a proof that Dask is able to run everywhere.

But it becomes difficult to understand and maintain. I guess - I have not yet contributed to any line of it.

Personally I will need soon a good diagram to see clearly in all the implementations options and how they could be reused from one to the others. Sure, something good could get out of this. So it may be my next task if I can succeed with the first one.

Anyway, thank you for taking the time answering me. btw : when i write "could we", just read "is it possible to"

quasiben commented 3 years ago

Dask on MPI systems are used and used quite a bit more than expected. You might be interested in the dask-mpi project. @jacobtomlinson have you used dask-mpi to start a dask-cuda cluster ?

The questions that you posed are great and I think the dask-mpi docs answer quite a few of them. For example, the batch jobs page outlines how the scheduler is started within the MPI job on Rank 0

andersy005 commented 3 years ago

How the client executed from Jupyter on the login node (usually outside the HPC cluster) will get back the scheduler address from the PBS job to connect to it ?

@jacobtomlinson, is this something https://github.com/dask-contrib/dask-ctl could help with now or in the future?

andersy005 commented 3 years ago

I attempted a PR for this last year, but it has since languished. The inspiration fro deploy_mode=local came from the k8s-dask machinery which tries to answer similar configuration questions of where the scheduler is (in client process or a pod in the k8s cluster)

@quasiben, is https://github.com/dask/dask-jobqueue/pull/390 still worth pursuing?

MordicusEtCubitus commented 3 years ago

Hi, thanks for the replies and links.

I think it can be easy to allow running CudaLocalCluster from dask.mpi So I will try to do it first as a kick start if not already done.

Then I will go back on how starting a PBSCluster based on Cuda (or CPU nodes) with the scheduler running in the HPC not on the login node using dask-spec-cli Then I will try to make this cluster reachable and launched from the login node (if possible) so it can be used from Jupyter.

That sounds a good workflow. One step at a time. Yes, will, try, not sure to succeed.

Will start on Thursday If you have any recommendation feel free...

jacobtomlinson commented 3 years ago

have you used dask-mpi to start a dask-cuda cluster ?

@quasiben I haven't. It seems you can choose between Nanny and Worker via a boolean flag but you can't provide a custom class today. It wouldn't be much effort to expand this though to support CUDAWorker.

The workflow then would be to submit a Python script via PBS which uses dask-mpi to start the cluster.

is this something https://github.com/dask-contrib/dask-ctl could help with now or in the future?

@andersy005 yeah this is something that it would support in the future. The dask-ctl support needs implementing in each of the cluster managers, so for work needs to be done in dask-jobqueue.

Given that dask-mpi doesn't use a cluster manager object this gets trickier integrating dask-ctl. This is partly why I have been experimenting with a Runner class in distributed to encapsulate things like dask-mpi, but there seems to be some push back on that approach.

is dask/dask-jobqueue#390 still worth pursuing?

Yeah I really think it is. It would be necessary for dask-ctl integration too.

guillaumeeb commented 3 years ago

Hey all, quite nice seeing this discussion.

I'm really in favor of all the improvements in this discussion.

I've still got a question though: is it really mandatory to be able to have the Scheduler running remotely with dask-jobqueue before implementing a solution to launch CudaWorker with it (so probably implementing dask-spec cli if I understood correctly)? I feel both improvements are somewhat independent (even if both might be required in some HPC centers...), and only the second is really needed to answer the original issue. But maybe I missed something.

@MordicusEtCubitus did you have time to work on this?

benjha commented 3 years ago

Hi @MordicusEtCubitus,

In our case what we decided to do is to created a job script where the dask-scheduler and dask-cuda-workers are launched, then launch the python client. Note that the scheduler and workers run on compute nodes while the python client is run in a batch/service node. This way we feel like we have more control of the mapping between workers and node resources.

Find below a sample based on the LSF scheduler but the pattern should be the same for PBS/slurm.

#BSUB -P <PROJECT>
#BSUB -W 0:05
#BSUB -alloc_flags "gpumps smt4 NVME"
#BSUB -nnodes 2
#BSUB -J rapids_dask_test_tcp
#BSUB -o rapids_dask_test_tcp_%J.out
#BSUB -e rapids_dask_test_tcp_%J.out

PROJ_ID=<project>

module load ums
module load ums-gen119
module load nvidia-rapids/0.18

SCHEDULER_DIR=$MEMBERWORK/$PROJ_ID/dask
WORKER_DIR=/mnt/bb/$USER

if [ ! -d "$SCHEDULER_DIR" ]
then
    mkdir $SCHEDULER_DIR
fi

SCHEDULER_FILE=$SCHEDULER_DIR/my-scheduler.json

echo 'Running scheduler'
jsrun --nrs 1 --tasks_per_rs 1 --cpu_per_rs 1 --smpiargs="-disable_gpu_hooks" \
      dask-scheduler --interface ib0 \
                     --scheduler-file $SCHEDULER_FILE \
                     --no-dashboard --no-show &

#Wait for the dask-scheduler to start
sleep 10

jsrun --rs_per_host 6 --tasks_per_rs 1 --cpu_per_rs 2 --gpu_per_rs 1 --smpiargs="-disable_gpu_hooks" \
      dask-cuda-worker --nthreads 1 --memory-limit 82GB --device-memory-limit 16GB --rmm-pool-size=15GB \
                       --death-timeout 60  --interface ib0 --scheduler-file $SCHEDULER_FILE --local-directory $WORKER_DIR \
                       --no-dashboard &

#Wait for WORKERS
sleep 10

WORKERS=12

python -u $CONDA_PREFIX/examples/dask-cuda/verify_dask_cuda_cluster.py $SCHEDULER_FILE $WORKERS

wait

#clean DASK files
rm -fr $SCHEDULER_DIR

echo "Done!"

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.