Closed claforte closed 1 year ago
Does running with srun give CUDA_VISIBLE_DEVICES=None
? If so, the code will only use the first GPU?
Actually Lightning enables various GPU-selection flags by setting the devices
attribute of Trainer
, see https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html. I manually set CUDA_VISIBLE_DEVICES
mainly because I want to use --gpu 0
for using the 0th GPU (which is interpreted to using zero GPUs in Lightning) and use --gpu 0,1
for using specific GPUs (which should be --gpu "0, 1"
with a comma in Lightning) :( I think there should be a better way to utilize the selection flags in Lightning.
Does running with srun give CUDA_VISIBLE_DEVICES=None? If so, the code will only use the first GPU?
srun automatically sets and exports CUDA_VISIBLE_DEVICES
, e.g. CUDA_VISIBLE_DEVICES=5,6
. I coded this PR assuming that we'll normally want to use all slurm-allocated devices. But along the way I discovered https://github.com/threestudio-project/threestudio/issues/195, which I think is a bug, to be investigated and fixed later.
see https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html.
Thanks @bennyguo , this was very helpful! This helped me simplify this PR. Please review and squash/merge. Thank you!
@voletiv can you review this carefully, and maybe try it out from the branch? @bennyguo is in vacations, and I'd like to use this from this moment on, to make less wasteful use of the cluster. I can show you my workflow.
@claforte I have a better idea to further simplify this. We just need to parse args.gpu
to devices
:
-1
to -1
i
(an integer) to [i]
(a list)i1,i2,...
(a list) to [i1,i2,...]
(a list)-1
In this way we don't even need to mess with CUDA_VISIBLE_DEVICES
any more. For slurm, just need to use the default -1
to use all available GPUs.
That sounds good... I thought the same at first, but put aside that simplified approach because I wasn't sure if downstream packages like tiny-cudann explicitly rely on CUDA_VISIBLE_DEVICES... Still, I'll try it out.
@claforte They may rely on the "current" device but I have this handled in the code so we don't need to worry about this :)
Using -1
as default has the disadvantage that it's non-trivial to determine n_gpus
(required downstream by config.py
to add a timestamp only when 1 GPU is used). I'll try to simplify the code as you suggested but with a default of 0
- i.e. use only the first available GPU. For slurm commands we can always recommend using --gpu -1
.
UPDATED: Actually, changing the default doesn't completely solve that problem. I'll stick with a default of 0, but will disable adding timestamp when -1 is passed.
Ah well... this feature turned out to be so much more complicated than I expected. Some downstream code (NerfAcc or tinycudann guessing from the stack trace) requires CUDA_VISIBLE_DEVICES to be defined, when passing a list of a single GPU (e.g. devices=[2]
) to Trainer(). e.g.:
python launch.py --config configs/dreamfusion-if.yaml --train --gpu 2 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes"
crashes with a deep stack trace with this error:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!
This problem doesn't occur if I manually prepend CUDA_VISIBLE_DEVICES=2
in front of that command, and pass --gpu 0
:
CUDA_VISIBLE_DEVICES=2 python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes"
... starts normally and uses GPU #2
as expected.
I conclude from this that we do need to set CUDA_VISIBLE_DEVICES.
@bennyguo This is the simplest I can make the logic, and still work in all cases. I tried supporting --gpu -1
in the case where CUDA_VISIBLE_DEVICES isn't set, but that caused infinite loops of forks. At this point, I feel I've spent excessive time on this feature. Please approve as-is since I tested it extensively. Thank you!
Sure, thanks @claforte !
The SLURM
srun
command assigns a node that has the specified number of available GPUs then sets CUDA_VISIBLE_DEVICES, e.g. to "6,7" if 2 GPUs were requested. Therefore we can't rely on--gpu 6,7
to specify which GPUs to use.I tested that this works for simple cases and for srun when there's 1 or more GPUs specified, except the following known issue:
export CUDA_VISIBLE_DEVICES=6,7
, threestudio seems to use both GPUs equally.I recommend holding back on merging it until tomorrow... I plan to do a bit more testing then.