Open nowtryz opened 3 months ago
Also, would it be possible not to warn when only MASTER_ADDR
and MASTER_PORT
are provided as there are used and seem to be intended to be provided
https://github.com/pytorch/ignite/blob/34a707e53785cf8a524589f33a570a7516fe064e/ignite/distributed/comp_models/native.py#L607-L614
Thanks for reporting the issue @nowtryz ! Let me see what can be done here.
Detect the presence of the --ntasks-per-gpu flag, which does not seem to be possible
There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?
Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected
Yes, this is unfortunate. IIRC, we rely on SLURM_LOCALID
as the a single local rank provider...
In this case, if idist.set_local_rank()
were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?
Hi,
There is no env var responsible for that this argument ? By the way, why this is necessary to set it and what's the typical value, 1 ?
From what I see, there is only one input environment variable and no output one. Yes --ntasks-per-gpu
would typically be used with a defined number of GPUs (--gpus=N), which helps spawn a process on a defined number of GPUs instead of a defined number of nodes. It also helps to eventually use spare resources from multiple nodes whereas --nodes=N --ntasks-per-node=M --gpus-per-nodes=M requires nodes to be completely free.
In this case, if idist.set_local_rank() were working, how would you set it ? Is it possible to know how which MPI process is binded to which GPU ?
I would simply use idist.set_local_rank(0)
which would use cuda:0. I don't exactly know how the binding is done but it seems to be similar to have the CUDA_VISIBLE_DEVICES flag set to the proper GPU and cuda:0 always points to the correct unit when --ntasks-per-gpu
is used.
Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU
is set when --ntasks-per-gpu
is specified.
@nowtryz can you provide the full traceback to get the exact error message?
If I understand correctly the problem, each process is seeing a single GPU and not all gpus, so torch.cuda.set_device(self._local_rank)
is failing for processes with local rank > 0.
As for the fix, IMO there are two things to be done here:
1) check here: https://github.com/pytorch/ignite/blob/34a707e53785cf8a524589f33a570a7516fe064e/ignite/distributed/comp_models/native.py#L126-L127 the number of available gpus with torch.cuda.device_count()
:
if torch.cuda.is_available():
lrank = self._local_rank if self._local_rank < torch.cuda.device_count() else 0
torch.cuda.set_device(lrank)
@nowtryz would you like to help solving this and check if this fix works ?
2) Allow to override local rank with idist.set_local_rank(), which is never considered when SLURM_JOB_ID is detected
Hi @vfdev-5,
Following https://slurm.schedmd.com/sbatch.html, SLURM_NTASKS_PER_GPU is set when --ntasks-per-gpu is specified.
SLURM_NTASKS_PER_GPU
is just set by sbatch
as an output variable so that it can be used as input by srun
. It may not be reliable. In my case --ntasks-per-gpu
was set on the srun commad directly, inside the script submitted to sbatch
.
@nowtryz can you provide the full traceback to get the exact error message
Sure, I will get the traceback ASAP
As for the fix, IMO there are two things to be done here:
sbatch
script sets MASTER_PORT
and MASTER_ADDR
, as the address of the node 0 can be easily retrieved from the submitted script (which is running on node 0) but it is completly ignore by ignite.This can be tricky to verify and get a correct DDP configuration when we mix slurm env vars with pytorch ddp env vars (e.g. MASTER_PORT
, MASTER_ADDR
, RANK
, LOCAL_RANK
, WORLD_SIZE
).
Could it be possible to use environment only for information not specified by the user?
what do you mean exactly here, which environment?
Here is where we translate slurm vars into pth env: https://github.com/pytorch/ignite/blob/aa3e3e13c214fe6cf72e941a46f13378911c8894/ignite/distributed/comp_models/native.py#L554-L639
Maybe, we could relax this part: https://github.com/pytorch/ignite/blob/aa3e3e13c214fe6cf72e941a46f13378911c8894/ignite/distributed/comp_models/native.py#L565-L571 and use MASTER_PORT
, MASTER_ADDR
, RANK
, LOCAL_RANK
, WORLD_SIZE
from the env if user has provided. The problem here could be to be able to verify whether there is not inconsistency between slurm env vars and user provided pth env vars...
For the info, some time ago, @sdesrozis wrote this notes on how to use ignite with slurm: https://github.com/sdesrozis/why-ignite/tree/main/basics/2_slurm
🐛 Bug description
When summoning a slurm step with multiple tasks assigning GPUs with the
--ntasks-per-gpu
flag instead of the--ntasks-per-node
as it seems it was intended, ignite uses theSLURM_LOCALID
environment as the local rank and use it as the device id to use even though the--ntasks-per-gpu
already binds the MPI process with a GPU, which cause the calltorch.cuda.set_device(self._local_rank)
to fail.To reproduce:
Which produces the following output:
Intended behaviour: Either
--ntasks-per-gpu
flag, which does not seem to be possibleidist.set_local_rank()
, which is never considered whenSLURM_JOB_ID
is detectedEnvironment
conda
,pip
, source): pip