torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Running into the same error on the 13b and 70b chat models. Using a h100 80GB card. The 7b chat model works fine.

Command (13b):

torchrun --nproc_per_node 2 example_chat_completion.py --ckpt_dir llama-2-13b-chat/ --tokenizer_path tokenizer.model --max_seq_len 4096 --max_batch_size 4

Error:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
> initializing model parallel with size 2
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "example_chat_completion.py", line 149, in <module>
    fire.Fire(main)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_chat_completion.py", line 20, in main
    generator = Llama.build(
  File "/home/ubuntu/llama/llama/generation.py", line 69, in build
    torch.cuda.set_device(local_rank)
  File "/usr/lib/python3/dist-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 74007 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 74008) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/bin/torchrun", line 11, in <module>
    load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')()
  File "/usr/lib/python3/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 344, in wrapper
    return f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/lib/python3/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/lib/python3/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-19_16:31:42
  host      : 209-20-158-162
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 74008)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html```

I faced the same issue in 7B- The client socket has failed to connect to [IN31GFRRL143ZWD.ap.wkglobal.com]:29500 (system error: 10049 - unknown error). Do you know how to solve this?

I'm getting this error too. 7B is the only model I've tried so far, as 70B was a little too big for me.

Hi,

Even I am experiencing the same issue while using 7B model on Jupyter notebook.

Logs attached below for reference.

`> initializing model parallel with size 1

initializing ddp with size 1 initializing pipeline with size 1 Loaded in 151.59 seconds Traceback (most recent call last): File "/home/jupyter/llama2/llama/example_chat_completion.py", line 90, in fire.Fire(main) File "/opt/conda/envs/llm/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/opt/conda/envs/llm/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/opt/conda/envs/llm/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/jupyter/llama2/llama/example_chat_completion.py", line 73, in main results = generator.chat_completion( File "/home/jupyter/llama2/llama/llama/generation.py", line 270, in chat_completion generation_tokens, generation_logprobs = self.generate( File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/jupyter/llama2/llama/llama/generation.py", line 122, in generate assert max_prompt_len <= params.max_seq_len AssertionError ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2914614) of binary: /opt/conda/envs/llm/bin/python3.10 Traceback (most recent call last): File "/opt/conda/envs/llm/bin/torchrun", line 8, in sys.exit(main()) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/envs/llm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-07-24_06:54:13 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2914614) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (llm) jupyter@umn-20230717-150749:~/llama2/llama$ `

Any thoughts to resolve this issue??

I have solved this error message because I didn't have an AMD or Nvidia graphic card, so I have installed the cpu version by installing this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

download the original version of Llama from : https://github.com/facebookresearch/llama and extract it to a llama-main folder
download th cpu version from : https://github.com/krychu/llama and extract it and replace files in the llama-main folder
run the download.sh script in a terminal, passing the URL provided when prompted to start the download
go to the llama-main folder
cretate an Python3 env : python3 -m venv env and activate it : source env/bin/activate
install the cpu version of pytorch : python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
install dependencies off llama : python3 -m pip install -e .

run if you have downloaded llama-2-7b :

torchrun --nproc_per_node 1 example_text_completion.py \
--ckpt_dir llama-2-7b/ \
--tokenizer_path tokenizer.model \
--max_seq_len 128 --max_batch_size 1 #(instead of 4)

Same issue with Windows11 32 GB RAM and RTX3090 24 GB VRAM trying to run 7B. Already tried different versions of CUDA and PyTorch without improvement. CPU is not an option for me. Any ideas, here is my error:

`(llama2env) PS Y:\231125 LLAMA2\llama-main> torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir ..\llama-2-7b-chat\ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 6 [2023-11-27 20:17:04,777] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [TROG2020]:29500 (system error: 10049 - Die angeforderte Adresse ist in diesem Kontext ung³ltig.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [TROG2020]:29500 (system error: 10049 - Die angeforderte Adresse ist in diesem Kontext ung³ltig.).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "Y:\231125 LLAMA2\llama-main\example_chat_completion.py", line 106, in fire.Fire(main) File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\example_chat_completion.py", line 37, in main generator = Llama.build( ^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\llama\generation.py", line 116, in build tokenizer = Tokenizer(model_path=tokenizer_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\llama\tokenizer.py", line 24, in init assert os.path.isfile(model_path), model_path AssertionError: tokenizer.model [2023-11-27 20:17:19,804] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 12048) of binary: Y:\231125 LLAMA2\llama2env\Scripts\python.exe Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "Y:\231125 LLAMA2\llama2env\Scripts\torchrun.exe__main.py", line 7, in File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\init__.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-27_20:17:19 host : XXX rank : 0 (local_rank: 0) exitcode : 1 (pid: 12048) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`

meta-llama / llama

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #420

example_chat_completion.py FAILED

example_chat_completion.py FAILED