meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
21.42k stars 2.14k forks source link

Redirects are currently not supported in Windows or MacOs #196

Open xingchaoet opened 3 weeks ago

xingchaoet commented 3 weeks ago

torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model --max_seq_len 512 --max_batch_size 6 [2024-05-08 08:37:17,241] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [iprotect.cloudcore.cn]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W socket.cpp:663] [c10d] The client socket has failed to connect to [iprotect.cloudcore.cn]:29500 (system error: 10049 - 在其上下文中,该请求的地址无效。). Traceback (most recent call last): Traceback (most recent call last): File "E:\new_space\github\ai\llama3\example_chat_completion.py", line 84, in fire.Fire(main) File "D:\tools\Python3106\lib\site-packages\fire\core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "D:\tools\Python3106\lib\site-packages\fire\core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "D:\tools\Python3106\lib\site-packages\fire\core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) File "E:\new_space\github\ai\llama3\example_chat_completion.py", line 31, in main generator = Llama.build( File "E:\new_space\github\ai\llama3\llama\generation.py", line 68, in build torch.distributed.init_process_group("nccl") File "D:\tools\Python3106\lib\site-packages\torch\distributed\c10d_logger.py", line 74, in wrapper func_return = func(args, **kwargs) File "D:\tools\Python3106\lib\site-packages\torch\distributed\distributed_c10d.py", line 1148, in init_process_group defaultpg, = _new_process_group_helper( File "D:\tools\Python3106\lib\site-packages\torch\distributed\distributed_c10d.py", line 1268, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in [2024-05-08 08:37:22,314] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22332) of binary: D:\tools\Python3106\python.exe Traceback (most recent call last): File "D:\tools\Python3106\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\tools\Python3106\lib\runpy.py", line 86, in _run_code

fbnav commented 3 weeks ago

Hi, it looks like you are running this on Windows. NCCL isn't supported on Windows. Can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if it works?

jhyangkorea commented 3 weeks ago

I have the same problem. So I tried with replacing torch.distributed.init_process_group("nccl") with torch.distributed.init_process_group(backend='gloo'). However it incurs the same error.

Endote commented 1 week ago

Hey, I have the same issue running on Windows. After replacing the nccl for gloo I get the following:

(env) C:\Users\1\Desktop\projects\LLM\llama3>torchrun --nproc_per_node 1 example_chat_completion.py --ckpt_dir Meta-Llama-3-8B-Instruct/ --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model  --max_seq_len 512 --max_batch_size 6
W0519 16:14:50.630995 18300 torch\distributed\elastic\multiprocessing\redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
[W socket.cpp:697] [c10d] The client socket has failed to connect to [view-localhost]:29500 (system error: 10049 - »╣dany adres jest nieprawid│owy w tym kontekťcie.).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [view-localhost]:29500 (system error: 10049 - »╣dany adres jest nieprawid│owy w tym kontekťcie.).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
[rank0]: Traceback (most recent call last):
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\example_chat_completion.py", line 84, in <module>
[rank0]:     fire.Fire(main)
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\fire\core.py", line 143, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\fire\core.py", line 477, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\fire\core.py", line 693, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\example_chat_completion.py", line 31, in main
[rank0]:     generator = Llama.build(
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\llama\generation.py", line 75, in build
[rank0]:     torch.cuda.set_device(local_rank)
[rank0]:   File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\torch\cuda\__init__.py", line 399, in set_device
[rank0]:     torch._C._cuda_setDevice(device)
[rank0]: AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'
E0519 16:14:55.696470 18300 torch\distributed\elastic\multiprocessing\api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 11588) of binary: C:\Users\1\Desktop\projects\LLM\llama3\env\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\1\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\1\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\1\Desktop\projects\LLM\llama3\env\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\torch\distributed\run.py", line 879, in main
    run(args)
  File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\torch\distributed\run.py", line 870, in run
    elastic_launch(
  File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "C:\Users\1\Desktop\projects\LLM\llama3\env\lib\site-packages\torch\distributed\launcher\api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_chat_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-19_16:14:55
  host      : DESKTOP-0LPT7H4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 11588)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Is there a solution to running the model on windows already?

fbnav commented 1 week ago

Hi, this might be because you may be using a version of PyTorch that is not compatible with your CUDA version. Could you try pip install torch --upgrade to try upgrading to a newer version of PyTorch that supports the _cuda_setDevice attribute? Or you can try reinstalling PyTorch and running it again. You can do it from here : https://pytorch.org/get-started/locally/

Also linking a similar issue where replacing the backend with gloo worked on Windows OS for reference : https://github.com/meta-llama/llama3/issues/127

For this repo specifically, the example scripts are for running inference on single (for 8B) and multi (for 70B) GPU setups using CUDA, but Windows is not currently supported. Feel free to check out the examples on the llama-recipes repo for running Llama locally via hugging face or ollama: https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/Running_Llama3_Anywhere