Open udhavsethi opened 1 year ago
me too. but ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9)
I also encountered this issue with exitcode: -9. Are there any updates on this?
me too, my config: V100 16G * 4, CPU RAM 128G , how to solve this problem?
me too
try one gpu, modify this parameter --nproc_per_node=1
Any solution ? 2xRTX 3090, same error :( !!!!!
I am trying to run the finetuning script using 8 32GB V100 GPUs. I am using the torchrun command for using deepspeed with both parameter and optimizer offload, with a few minor modifications:
I am running into the following errors:
Here is my nvcc version:
and nccl version:
Please let me know if I can provide any other information to identify the source of this issue. I would highly appreciate any help or guidance on how to make this work.