meta-llama / llama

Inference code for Llama models
Other
56.6k stars 9.59k forks source link

Issue with Redirects Not Supported Error in Windows and macOS. When running torchrun, a RuntimeError is encountered with the message "unmatched '}' in format string." #347

Open ericoder960803 opened 1 year ago

ericoder960803 commented 1 year ago

Description: When running the command, a RuntimeError is encountered with the message "unmatched '}' in format string." Run command

torchrun --nproc_per_node 1  -- rd example.py --ckpt_dir ./models/7B --tokenizer_path ./models/tokenizer.model

I encountered an issue while running a script that involves redirecting output. It seems that redirects are currently not supported in Windows environments This issue causes a runtime error with the following traceback:

NOTE: Redirects are currently not supported in Windows or macOS.
Traceback (most recent call last):
  File "C:\Users\[username]\anaconda3\Scripts\torchrun-script.py", line 34, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\launcher\api.py", line 241, in launch_agent
    result = agent.run()
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 723, in run
    result = self._invoke_run(role)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\metrics\api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\agent\server\api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "C:\Users\[username]\anaconda3\lib\site-packages\torch\distributed\elastic\rendezvous\static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: unmatched '}' in format string

Environment: Operating System: Windows10 Python Version: Python 3.9.13 Torch Version: 2.0.1

Please let me know if any further information is required to address this issue.

Delagardi commented 1 year ago

the same problem is on Mac OS

Operating System: Mac OS, 11.5.2 Python Version: Python 3.11.3 Torch Version: 2.0.1

sfcheng commented 1 year ago

Same issue on Windows 11.

torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b/ --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4 NOTE: Redirects are currently not supported in Windows or MacOs. [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [alienware]:29500 (system error: 10049 - The requested address is not valid in its context.). Traceback (most recent call last): File "I:\projects\llama\example_text_completion.py", line 55, in fire.Fire(main) File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "i:\apps\miniconda3\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "I:\projects\llama\example_text_completion.py", line 18, in main generator = Llama.build( File "I:\projects\llama\llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") File "i:\apps\miniconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group default_pg = _new_process_group_helper( File "i:\apps\miniconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 96416) of binary: i:\apps\miniconda3\python.exe Traceback (most recent call last): File "i:\apps\miniconda3\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "i:\apps\miniconda3\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "i:\apps\miniconda3\Scripts\torchrun.exe__main.py", line 7, in File "i:\apps\miniconda3\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\init__.py", line 346, in wrapper return f(args, **kwargs) File "i:\apps\miniconda3\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "i:\apps\miniconda3\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "i:\apps\miniconda3\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "i:\apps\miniconda3\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

HelixNGC7293 commented 1 year ago

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

MaximilianDueppe commented 1 year ago

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

I tried this, but nothing changed for me. Still trying to resolve this issue.

MirunaClinciu commented 1 year ago

Same issue model llama-2-7b-chat What I tried: (I will update this)

  1. adding torch.distributed.init_process_group("gloo") => doesn't work
  2. import os os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = "gloo" => doesn't work
  3. I tried different --max_batch_size 1, 3, 6 etc. => doesn't work
ajithkumar666 commented 1 year ago

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

Its not worked in Windows for me also

ghost commented 1 year ago

same here any solution ?

mateury commented 1 year ago

Same error here, while i'm trying to run example_chat_completion.py System: MacOS 14.0 (M1)

File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in [2023-10-08 20:52:17,432] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 70815) of binary: /opt/homebrew/opt/python@3.11/bin/python3.11

ghost commented 1 year ago

I got passed of this problem by installing gloo and going to generation.py and changing nccl to gloo find it and just replace it.

I have been trying to run it for 2 days.but i think 8gb vram and 32gb ram is not enough to run it. I will suggest you to use model from the_bloke from huggingface which are quantized and use llama.cpp to run on your mac . You can even run it using your mac gpu.

But if you want to train data then learn somehow how to train and then convert it to a quantize model runnable with llama.cpp

Nehe12 commented 1 year ago

Si me inicializa el modelo pero aparece este error [2023-11-08 12:23:44,484] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 104, in fire.Fire(main) File "C:\Python311\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 35, in main generator = Llama.build( ^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\llama\generation.py", line 92, in build torch.cuda.set_device(local_rank) File "C:\Python311\Lib\site-packages\torch\cuda__init.py", line 404, in set_device torch._C._cuda_setDevice(device) ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' [2023-11-08 12:23:49,537] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22516) of binary: C:\Python311\python.exe Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\Python311\Scripts\torchrun.exe\main.py", line 7, in File "C:\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\init__.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-08_12:23:49 host : CC rank : 0 (local_rank: 0) exitcode : 1 (pid: 22516) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
baokexu commented 12 months ago

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

looks like this resolved my issue. even though another issue comes... but it is a diff one now.

bravelyi commented 9 months ago

have people to resolve this issue?

naget commented 8 months ago

have people to resolve this issue?

shailenderjain commented 8 months ago

I am able to run on Debian environment. Is there any solution for this issue on Windows. I tried to update generation.oy torch.distributed.init_process_group("nccl") --> generation.oy torch.distributed.init_process_group("gloo"). But it does not work.

alperinugur commented 7 months ago

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

Worked fine for me

Robert-Jia00129 commented 5 months ago

I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")

it actually works for me.

Adding on to this, if you are mac and don't have cuda support, comment out the line: torch.cuda.set_device(local_rank) in ./llama/generate.py

montyc123 commented 5 months ago

./llama/generate.py