multi-GPU training issues

openai / guided-diffusion

MIT License

6.03k stars 803 forks source link

multi-GPU training issues #72

Closed Germany321 closed 1 year ago

Germany321 commented 1 year ago

I set the number of node to 2, each node has 8 GPUs. However, I find that only first GPU has been utilized, how can I solve this problem?

Germany321 commented 1 year ago

This issue has been fixed by changing the code.

muhamadusman commented 1 year ago

I am facing the same problem I have 8 GPUs on my node, but I am getting the error of Cuda out of memory on GPU 0 only apparently the code is not utilising all other 7 GPUs

Cuda is available : True Cuda Visible Devices : 8

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 39.59 GiB total capacity; 36.02 GiB already allocated; 410.19 MiB free; 37.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

rahulshenoy77 commented 1 year ago

I have the same issue as @muhamadusman. Does anyone have fix to this problem?

rahulshenoy77 commented 1 year ago

This issue has been fixed by changing the code.

How did you fix this issue by changing the code?

muhamadusman commented 1 year ago

I have the same issue as @muhamadusman. Does anyone have fix to this problem? the error is resolved by using the mpiexec but the problem stays the same, training time is not reduced by 8 times while using 8 GPUs instead of 1 though for inference using mpiexec with 8 gpu actually reduces the inference time

muhamadusman commented 1 year ago

I ended up training the model on a single GPU

99-WSJ commented 11 months ago

This issue has been fixed by changing the code. 你好，请问如何修改代码，适应单机多卡gpus训练，谢谢

HangXux commented 7 months ago

This issue has been fixed by changing the code.

Hi, how did you change the code?

sushilkhadkaanon commented 7 months ago

@muhamadusman @HangXux @99-WSJ @rahulshenoy77 How did you guys fix the issue?