pytorch / ignite

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.
https://pytorch-ignite.ai
BSD 3-Clause "New" or "Revised" License
4.51k stars 612 forks source link

`idist.initialize` fails in Slurm when using `--ntasks-per-gpu` #3259

Open nowtryz opened 3 months ago

nowtryz commented 3 months ago

🐛 Bug description

When summoning a slurm step with multiple tasks assigning GPUs with the --ntasks-per-gpu flag instead of the --ntasks-per-node as it seems it was intended, ignite uses the SLURM_LOCALID environment as the local rank and use it as the device id to use even though the --ntasks-per-gpu already binds the MPI process with a GPU, which cause the call torch.cuda.set_device(self._local_rank) to fail.

To reproduce:

srun --ntasks-per-gpu=1 --nodes=2 --gpus-per-node=4 python -e "import ignite.distributed as idist; idist.initialize(backend='nccl')"

Which produces the following output:

    idist.initialize(backend="nccl")
  File ".../python3.11/site-packages/ignite/distributed/utils.py", line 577, in initialize
    _set_model(comp_model_cls(backend, **kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 92, in __init__
    self._create_from_backend(
  File ".../python3.11/site-packages/ignite/distributed/comp_models/native.py", line 127, in _create_from_backend
    torch.cuda.set_device(self._local_rank)
  File ".../python3.11/site-packages/torch/cuda/__init__.py", line 408, in set_device
    torch._C._cuda_setDevice(device)

Intended behaviour: Either

Environment

nowtryz commented 3 months ago

Also, would it be possible not to warn when only MASTER_ADDR and MASTER_PORT are provided as there are used and seem to be intended to be provided https://github.com/pytorch/ignite/blob/34a707e53785cf8a524589f33a570a7516fe064e/ignite/distributed/comp_models/native.py#L607-L614

vfdev-5 commented 3 months ago

Thanks for reporting the issue @nowtryz ! Let me see what can be done here.

Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible

There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?

Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

Yes, this is unfortunate. IIRC, we rely on SLURM_LOCALID as the a single local rank provider... In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?

nowtryz commented 3 months ago

Hi,

There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?

From what I see, there is only one input environment variable and no output one. Yes --ntasks-per-gpu would typically be used with a defined number of GPUs (--gpus=N), which helps spawn a process on a defined number of GPUs instead of a defined number of nodes. It also helps to eventually use spare resources from multiple nodes whereas --nodes=N --ntasks-per-node=M --gpus-per-nodes=M requires nodes to be completely free.

In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?

I would simply use idist.set_local_rank(0) which would use cuda:0. I don't exactly know how the binding is done but it seems to be similar to have the CUDA_VISIBLE_DEVICES flag set to the proper GPU and cuda:0 always points to the correct unit when --ntasks-per-gpu is used.

vfdev-5 commented 3 months ago

Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.

@nowtryz can you provide the full traceback to get the exact error message?

If I understand correctly the problem, each process is seeing a single GPU and not all gpus, so torch.cuda.set_device(self._local_rank) is failing for processes with local rank > 0.

As for the fix, IMO there are two things to be done here: 1) check here: https://github.com/pytorch/ignite/blob/34a707e53785cf8a524589f33a570a7516fe064e/ignite/distributed/comp_models/native.py#L126-L127 the number of available gpus with torch.cuda.device_count():

 if torch.cuda.is_available():
     lrank = self._local_rank if self._local_rank < torch.cuda.device_count() else 0
     torch.cuda.set_device(lrank)

@nowtryz would you like to help solving this and check if this fix works ?

2) Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected

nowtryz commented 1 month ago

Hi @vfdev-5,

Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.

SLURM_NTASKS_PER_GPU is just set by sbatch as an output variable so that it can be used as input by srun. It may not be reliable. In my case --ntasks-per-gpu was set on the srun commad directly, inside the script submitted to sbatch.

@nowtryz can you provide the full traceback to get the exact error message

Sure, I will get the traceback ASAP

As for the fix, IMO there are two things to be done here:

  1. The first solution seems like a simple workaround, I will try it.
  2. I think this is the correct solution. Even more, I think ignite should still allow user configuration on slurm as the slurm/job may not be configured the way it was intended in ignite. Could it be possible to use environment only for information not specified by the user? Let me explain, in my case for example, my sbatch script sets MASTER_PORT and MASTER_ADDR, as the address of the node 0 can be easily retrieved from the submitted script (which is running on node 0) but it is completly ignore by ignite.
vfdev-5 commented 1 month ago

This can be tricky to verify and get a correct DDP configuration when we mix slurm env vars with pytorch ddp env vars (e.g. MASTER_PORT, MASTER_ADDR, RANK, LOCAL_RANK, WORLD_SIZE).

Could it be possible to use environment only for information not specified by the user?

what do you mean exactly here, which environment?

Here is where we translate slurm vars into pth env: https://github.com/pytorch/ignite/blob/aa3e3e13c214fe6cf72e941a46f13378911c8894/ignite/distributed/comp_models/native.py#L554-L639 Maybe, we could relax this part: https://github.com/pytorch/ignite/blob/aa3e3e13c214fe6cf72e941a46f13378911c8894/ignite/distributed/comp_models/native.py#L565-L571 and use MASTER_PORT, MASTER_ADDR, RANK, LOCAL_RANK, WORLD_SIZE from the env if user has provided. The problem here could be to be able to verify whether there is not inconsistency between slurm env vars and user provided pth env vars...

For the info, some time ago, @sdesrozis wrote this notes on how to use ignite with slurm: https://github.com/sdesrozis/why-ignite/tree/main/basics/2_slurm