meta-llama / llama

Inference code for Llama models
Other
55.82k stars 9.51k forks source link

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 760) #877

Open bechellis opened 11 months ago

bechellis commented 11 months ago

Hi everybody,

I tried to deploy the llama2 model in pytorch/CUDA env:

CUDA version: 12.1 ID of current CUDA device: 0 Name of current CUDA device: Quadro P4000

but I found the following issue, has someone an idea of what's wrong?

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 [2023-10-26 11:56:24,266] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 2283) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/user/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-10-26_11:56:22 host : rank : 0 (local_rank: 0) exitcode : -9 (pid: 2283) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 2283

Runtime Environment

WieMaKa commented 11 months ago

Facing same issue here

subramen commented 11 months ago

Please share the full stacktrace which contains the actual error.

lmelinda commented 7 months ago

when I see this issue, I actually dont see any other stack trace. the full log starts with torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) which is the same as the post here

amew0 commented 6 months ago

Facing same issue here!!

Following the same step, and env is Google Colab T4 GPU

lmelinda commented 6 months ago

in my case CPU going out of memory seems to contribute to it