meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
25.72k stars 2.86k forks source link

用了8块a100-40g 运行llama3-70b-instruct 提示如下错误 #113

Open flowbywind opened 4 months ago

flowbywind commented 4 months ago

用了8块a100-40g 运行llama3-70b-instruct 提示如下错误 [2024-04-22 10:52:15,696] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-04-22 10:52:15,696] torch.distributed.run: [WARNING] *****

initializing model parallel with size 8 initializing ddp with size 1 initializing pipeline with size 1 [2024-04-22 10:53:55,894] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7159 closing signal SIGTERM [2024-04-22 10:53:55,966] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7160 closing signal SIGTERM [2024-04-22 10:53:55,966] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7161 closing signal SIGTERM [2024-04-22 10:53:55,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7162 closing signal SIGTERM [2024-04-22 10:53:55,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7163 closing signal SIGTERM [2024-04-22 10:53:55,967] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7164 closing signal SIGTERM [2024-04-22 10:53:55,968] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 7165 closing signal SIGTERM [2024-04-22 10:53:58,513] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 7158) of binary: /home/vipuser/anaconda3/envs/llm/bin/python3.10 Traceback (most recent call last): File "/home/vipuser/anaconda3/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

Failures:

----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2024-04-22_10:53:55 host : pc_0 rank : 0 (local_rank: 0) exitcode : -9 (pid: 7158) error_file: traceback : Signal 9 (SIGKILL) received by PID 7158 ===================================================== python-BaseException
jidandan666 commented 4 months ago

运行命令怎么写的

flowbywind commented 4 months ago

运行命令怎么写的

torchrun --nproc_per_node 8 example_text_completion.py --ckpt_dir Meta-Llama-3-70B-Instruct/ --tokenizer_path Meta-Llama-3-70B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 8

YanJiaHuan commented 3 weeks ago

any update lately?