meta-llama / llama

Inference code for Llama models
Other
56.31k stars 9.56k forks source link

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #482

Closed MDFARHYN closed 1 year ago

MDFARHYN commented 1 year ago

I downloaded the llama-2-7b and run the command as they metioned

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 4 

but got this error

NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [farhan]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [farhan]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [farhan]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [farhan]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "E:\llama-main\llama-main\example_text_completion.py", line 55, in <module>
    fire.Fire(main)
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "E:\llama-main\llama-main\example_text_completion.py", line 18, in main
    generator = Llama.build(
  File "E:\llama-main\llama-main\llama\generation.py", line 62, in build
    torch.distributed.init_process_group("nccl")
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 13040) of binary: C:\Users\tusar\AppData\Local\Programs\Python\Python310\python.exe
Traceback (most recent call last):
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\tusar\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_text_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-21_21:19:19
  host      : farhan.www.tendawifi.com
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13040)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
RafoolVinci commented 1 year ago

Same error me also.

rakshith111 commented 1 year ago

I got the same error when running in wsl ubuntu $ uname -a Linux DESKTOP-40049K6 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

liudengfeng commented 1 year ago

I got the same error when running in wsl ubuntu $ uname -a Linux D2 5.15.90.1-microsoft-standard-WSL2 #1 SMP Fri Jan 27 02:56:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

lonestarx1 commented 1 year ago

Same error here

eguar11011 commented 1 year ago

Same error here on Colab

pzim-devdata commented 1 year ago

I have solved it with a cpu installation by installing this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

  1. download the original version of Llama from : https://github.com/facebookresearch/llama and extract it to a llama-main folder
  2. download th cpu version from : https://github.com/krychu/llama and extract it and replace files in the llama-main folder
  3. run the download.sh script in a terminal, passing the URL provided when prompted to start the download
  4. go to the llama-main folder
  5. cretate an Python3 env : python3 -m venv env and activate it : source env/bin/activate
  6. install the cpu version of pytorch : python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu
  7. install dependencies off llama : python3 -m pip install -e .
  8. run if you have downloaded llama-2-7b :
    torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 1 #(instead of 4)
rakshith111 commented 1 year ago

I have solved it with a cpu installation by installing this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

1. download the original version of Llama from : `https://github.com/facebookresearch/llama` and extract it to a `llama-main` folder

2. download th cpu version from : `https://github.com/krychu/llama` and extract it and replace files in the `llama-main` folder

3. run the `download.sh` script in a terminal, passing the URL provided when prompted to start the download

4. go to the `llama-main` folder

5. cretate an Python3 env : `python3 -m venv env` and activate it : `source env/bin/activate`

6. install the cpu version of pytorch : `python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu`

7. install dependencies off llama : `python3 -m pip install -e .`

8. run if you have downloaded llama-2-7b :
torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 1 #(instead of 4)

Nice!!! But is there no way to use it on gpu? my best guess is there might be a problem with latest version of torchvision

pzim-devdata commented 1 year ago

I have solved it with a cpu installation by installing this : https://github.com/krychu/llama instead of https://github.com/facebookresearch/llama Complete process to install :

1. download the original version of Llama from : `https://github.com/facebookresearch/llama` and extract it to a `llama-main` folder

2. download th cpu version from : `https://github.com/krychu/llama` and extract it and replace files in the `llama-main` folder

3. run the `download.sh` script in a terminal, passing the URL provided when prompted to start the download

4. go to the `llama-main` folder

5. cretate an Python3 env : `python3 -m venv env` and activate it : `source env/bin/activate`

6. install the cpu version of pytorch : `python3 -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu #pour la version cpu`

7. install dependencies off llama : `python3 -m pip install -e .`

8. run if you have downloaded llama-2-7b :
torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 1 #(instead of 4)

Nice!!! But is there no way to use it on gpu? my best guess is there might be a problem with latest version of torchvision

I'm not an expert for pytorch, I don't know what was the problem. Wait and see for facebook to react. I have used pytorch about 10 years before and it was a small librarie. Today I don't understand nothing of what I do with it lol

ch3njust1n commented 1 year ago

I'm getting the same issue on Apple M1 Max

MDFARHYN commented 1 year ago

pzim-devdata Thanks a lot, it's working. I have a few questions 1) it taking too much time for generating a response? how to reduce the time? my pc configuration is 16GB RAM core i5 12th gen preocessor and 2) what is difference bewteen llama-2-7b, llama-2-7b-chat, llama-2-13b and llama-2-13b-chat? 3) what is max_batch_size ? what is temperature? what is token ?

pzim-devdata commented 1 year ago

Yes it's very long. This solution is just for trying llama. You will need to use it with your GPU when the bug will be fixed. Cuda works just for Nvidia video card. If you have an AMD or Intel video card you have to install pytorch with ROCm but I don't know if Lllama is working with ROCm. The difference between llama-2-7b and llama-2-7b-chat is llama-2-7b will just finishing the sentence in the prompt and the chat version is a question/answer version with infinite prompts. 7B works with 1 GPU card 13B works with minimum 2 GPU cards 70B works with minimum 8 With your configuration the best solution is to go the website of Llama for playing with it : https://chat.lmsys.org/

MDFARHYN commented 1 year ago

Thanks

Straafe commented 1 year ago

Still getting this error as well. WSL2 with a 3090 (not interested in running CPU only, interested in it running on the 3090)

jiesutd commented 1 year ago

Same error with RedHat, a single V100 GPU with > 300G RAM. Any solution?

RahulSChand commented 1 year ago

If anyone is still facing an issue, do one of the following

Add @record over the main function & it will give you a proper traceback

from torch.distributed.elastic.multiprocessing.errors import record

@record
def main(...)

or

go to /usr/log/kern.log & check the message on last line, it will show you if its because of insufficient VRAM

suwhoanlim commented 1 year ago

If anyone is still facing an issue, do one of the following

Add @record over the main function & it will give you a proper traceback

from torch.distributed.elastic.multiprocessing.errors import record

@record
def main(...)

or

go to /usr/log/kern.log & check the message on last line, it will show you if its because of insufficient VRAM

TLDR: try changing batch size from 4 to any number greater than 4. Changing 4 to 6 worked for me.

I was having the same error message. torch.distributed.elastic.multiprocessing.errors.ChildFailedError: . When I tried Rahul's method of adding @record. Turns out I was having an assertion error due to batch size.

    File "/home/soma1/docs/mine/llama/llama/generation.py", line 117, in generate
      assert bsz <= params.max_batch_size, (bsz, params.max_batch_size)
  AssertionError: (6, 4)

So I tried the following, and it worked without any problem!

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir llama-2-7b/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 128 --max_batch_size 6 #(instead of 4)
Tuxius commented 11 months ago

same error, change max_batch_size to any number did not help. Using Windows 11 with 32 GB RAM and RTX3090 with 24 GB VRAM. Tried different versions of CUDA and PyTorch did also not help. Any other ideas? Here is my error:

`(llama2env) PS Y:\231125 LLAMA2\llama-main> torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir ..\llama-2-7b-chat\ --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 [2023-11-27 20:35:09,370] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [TROG2020]:29500 (system error: 10049 - Die angeforderte Adresse ist in diesem Kontext ung³ltig.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [TROG2020]:29500 (system error: 10049 - Die angeforderte Adresse ist in diesem Kontext ung³ltig.).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "Y:\231125 LLAMA2\llama-main\example_chat_completion.py", line 106, in fire.Fire(main) File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\example_chat_completion.py", line 37, in main generator = Llama.build( ^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\llama\generation.py", line 116, in build tokenizer = Tokenizer(model_path=tokenizer_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama-main\llama\tokenizer.py", line 24, in init assert os.path.isfile(model_path), model_path AssertionError: tokenizer.model [2023-11-27 20:35:19,398] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 20040) of binary: Y:\231125 LLAMA2\llama2env\Scripts\python.exe Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "Y:\231125 LLAMA2\llama2env\Scripts\torchrun.exe__main.py", line 7, in File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\init__.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "Y:\231125 LLAMA2\llama2env\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-27_20:35:19 host : XXX rank : 0 (local_rank: 0) exitcode : 1 (pid: 20040) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================`
Straafe commented 11 months ago

I've been running llama and other models through ooba and haven't been using this anymore, ooba works fine.

sunyuhan19981208 commented 9 months ago

If anyone is still facing an issue, do one of the following

Add @record over the main function & it will give you a proper traceback

from torch.distributed.elastic.multiprocessing.errors import record

@record
def main(...)

or

go to /usr/log/kern.log & check the message on last line, it will show you if its because of insufficient VRAM

Nice answer, I met this error exactly because of CPU OOM.

TailinZhou commented 8 months ago

I found, when I deleted all '\' in the command 'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir llama-2-7b/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 4' , the error has gone.