Loading extension module utils...
Time to load utils op: 0.2017381191253662 seconds
Parameter Offload: Total persistent parameters: 414720 in 81 params
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53038 closing
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53039 closing
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53041 closing
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rankroot/anaconda3/envs/faster/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/faster/bin/torchrun", line 8, in
It seem cpu memory run out
before this error occur , cpu availabe memory reduce to 1G sharply
As I know , liunx shutdown this process and return (exitcode: -9) as linux do not provide enough resources
Loading extension module utils... Time to load utils op: 0.2017381191253662 seconds Parameter Offload: Total persistent parameters: 414720 in 81 params WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53038 closing WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53039 closing WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53041 closing ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rankroot/anaconda3/envs/faster/bin/python Traceback (most recent call last): File "/root/anaconda3/envs/faster/bin/torchrun", line 8, in It seem cpu memory run out before this error occur , cpu availabe memory reduce to 1G sharply As I know , liunx shutdown this process and return (exitcode: -9) as linux do not provide enough resources
My machine 32vCPU, 256G memory with 4 * A100 80GB