torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 760)

bechellis commented 11 months ago

Hi everybody,

I tried to deploy the llama2 model in pytorch/CUDA env:

CUDA version: 12.1 ID of current CUDA device: 0 Name of current CUDA device: Quadro P4000

but I found the following issue, has someone an idea of what's wrong?

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat/ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 [2023-10-26 11:56:24,266] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 2283) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/user/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/user/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

Root Cause (first observed failure): [0]: time : 2023-10-26_11:56:22 host : rank : 0 (local_rank: 0) exitcode : -9 (pid: 2283) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 2283

Runtime Environment

Model: [eg: llama-2-7b-chat]
Using via huggingface?: [yes/no]
OS: Ubuntu
GPU VRAM: 8GB
Number of GPUs: 4
GPU Make: Nvidia Quadro P4000

WieMaKa commented 11 months ago

Facing same issue here

subramen commented 11 months ago

Please share the full stacktrace which contains the actual error.

lmelinda commented 7 months ago

when I see this issue, I actually dont see any other stack trace. the full log starts with torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) which is the same as the post here

amew0 commented 6 months ago

Facing same issue here!!

Following the same step, and env is Google Colab T4 GPU

lmelinda commented 6 months ago

in my case CPU going out of memory seems to contribute to it

meta-llama / llama

torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 760) #877

example_chat_completion.py FAILED

Runtime Environment