Closed Germany321 closed 1 year ago
This issue has been fixed by changing the code.
I am facing the same problem I have 8 GPUs on my node, but I am getting the error of Cuda out of memory on GPU 0 only apparently the code is not utilising all other 7 GPUs
Cuda is available : True Cuda Visible Devices : 8
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 39.59 GiB total capacity; 36.02 GiB already allocated; 410.19 MiB free; 37.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have the same issue as @muhamadusman. Does anyone have fix to this problem?
This issue has been fixed by changing the code.
How did you fix this issue by changing the code?
I have the same issue as @muhamadusman. Does anyone have fix to this problem? the error is resolved by using the mpiexec but the problem stays the same, training time is not reduced by 8 times while using 8 GPUs instead of 1 though for inference using mpiexec with 8 gpu actually reduces the inference time
I ended up training the model on a single GPU
This issue has been fixed by changing the code. 你好,请问如何修改代码,适应单机多卡gpus训练,谢谢
This issue has been fixed by changing the code.
Hi, how did you change the code?
@muhamadusman @HangXux @99-WSJ @rahulshenoy77 How did you guys fix the issue?
I set the number of node to 2, each node has 8 GPUs. However, I find that only first GPU has been utilized, how can I solve this problem?