tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.55k stars 4.05k forks source link

Train 13B data error #183

Open A-ML-ER opened 1 year ago

A-ML-ER commented 1 year ago

Loading extension module utils... Time to load utils op: 0.2017381191253662 seconds Parameter Offload: Total persistent parameters: 414720 in 81 params WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53038 closing WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53039 closing WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 53041 closing ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rankroot/anaconda3/envs/faster/bin/python Traceback (most recent call last):   File "/root/anaconda3/envs/faster/bin/torchrun", line 8, in         It seem cpu memory run out before this error occur , cpu availabe memory reduce to 1G sharply As I know , liunx shutdown this process and return (exitcode: -9) as linux do not provide enough resources

  My machine 32vCPU, 256G memory with 4 * A100 80GB

wallon-ai commented 1 year ago

我也遇到了相同的问题,请问有解决办法了吗

LuciaIsFine commented 10 months ago

same problem……