Support SLURM - Githubissues

claforte commented 1 year ago

The SLURM srun command assigns a node that has the specified number of available GPUs then sets CUDA_VISIBLE_DEVICES, e.g. to "6,7" if 2 GPUs were requested. Therefore we can't rely on --gpu 6,7 to specify which GPUs to use.

I tested that this works for simple cases and for srun when there's 1 or more GPUs specified, except the following known issue:

when srun has 2 or more GPUs, only the 1st is used by ThreeStudio. I couldn't figure out why yet. When I simulate this by manually setting export CUDA_VISIBLE_DEVICES=6,7, threestudio seems to use both GPUs equally.

I recommend holding back on merging it until tomorrow... I plan to do a bit more testing then.

bennyguo commented 1 year ago

Does running with srun give CUDA_VISIBLE_DEVICES=None? If so, the code will only use the first GPU?

Actually Lightning enables various GPU-selection flags by setting the devices attribute of Trainer, see https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html. I manually set CUDA_VISIBLE_DEVICES mainly because I want to use --gpu 0 for using the 0th GPU (which is interpreted to using zero GPUs in Lightning) and use --gpu 0,1 for using specific GPUs (which should be --gpu "0, 1" with a comma in Lightning) :( I think there should be a better way to utilize the selection flags in Lightning.

claforte commented 1 year ago

Does running with srun give CUDA_VISIBLE_DEVICES=None? If so, the code will only use the first GPU?

srun automatically sets and exports CUDA_VISIBLE_DEVICES, e.g. CUDA_VISIBLE_DEVICES=5,6. I coded this PR assuming that we'll normally want to use all slurm-allocated devices. But along the way I discovered https://github.com/threestudio-project/threestudio/issues/195, which I think is a bug, to be investigated and fixed later.

see https://lightning.ai/docs/pytorch/stable/accelerators/gpu_basic.html.

Thanks @bennyguo , this was very helpful! This helped me simplify this PR. Please review and squash/merge. Thank you!

claforte commented 1 year ago

@voletiv can you review this carefully, and maybe try it out from the branch? @bennyguo is in vacations, and I'd like to use this from this moment on, to make less wasteful use of the cluster. I can show you my workflow.

bennyguo commented 1 year ago

@claforte I have a better idea to further simplify this. We just need to parse args.gpu to devices:

-1 to -1
i (an integer) to [i] (a list)
i1,i2,... (a list) to [i1,i2,...] (a list)
default to -1

In this way we don't even need to mess with CUDA_VISIBLE_DEVICES any more. For slurm, just need to use the default -1 to use all available GPUs.

claforte commented 1 year ago

That sounds good... I thought the same at first, but put aside that simplified approach because I wasn't sure if downstream packages like tiny-cudann explicitly rely on CUDA_VISIBLE_DEVICES... Still, I'll try it out.

bennyguo commented 1 year ago

@claforte They may rely on the "current" device but I have this handled in the code so we don't need to worry about this :)

claforte commented 1 year ago

Using -1 as default has the disadvantage that it's non-trivial to determine n_gpus (required downstream by config.py to add a timestamp only when 1 GPU is used). I'll try to simplify the code as you suggested but with a default of 0 - i.e. use only the first available GPU. For slurm commands we can always recommend using --gpu -1.

UPDATED: Actually, changing the default doesn't completely solve that problem. I'll stick with a default of 0, but will disable adding timestamp when -1 is passed.

claforte commented 1 year ago

Ah well... this feature turned out to be so much more complicated than I expected. Some downstream code (NerfAcc or tinycudann guessing from the stack trace) requires CUDA_VISIBLE_DEVICES to be defined, when passing a list of a single GPU (e.g. devices=[2]) to Trainer(). e.g.:

python launch.py --config configs/dreamfusion-if.yaml --train --gpu 2 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes"

crashes with a deep stack trace with this error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:2 and cuda:0!

This problem doesn't occur if I manually prepend CUDA_VISIBLE_DEVICES=2 in front of that command, and pass --gpu 0:

CUDA_VISIBLE_DEVICES=2 python launch.py --config configs/dreamfusion-if.yaml --train --gpu 0 system.prompt_processor.prompt="a zoomed out DSLR photo of a baby bunny sitting on top of a stack of pancakes"

... starts normally and uses GPU #2 as expected.

I conclude from this that we do need to set CUDA_VISIBLE_DEVICES.

claforte commented 1 year ago

@bennyguo This is the simplest I can make the logic, and still work in all cases. I tried supporting --gpu -1 in the case where CUDA_VISIBLE_DEVICES isn't set, but that caused infinite loops of forks. At this point, I feel I've spent excessive time on this feature. Please approve as-is since I tested it extensively. Thank you!

bennyguo commented 1 year ago

Sure, thanks @claforte !

threestudio-project / threestudio

Support SLURM #191