Cant use Windows as fire packacge requires NCCL

marcusobrien commented 1 year ago

Hi,

How can I use this system on Windows when it can only be run with NCCL ?

The instructions require a lot of changing for this - the example script can not be without switching the backend to goo from NCCL

RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 23152) of binary: U:\Miniconda3\envs\llama2env\python.exe

Then you can't use the torchrun.exe as the example shows, as this fails/complains due to unable to start a process, you have to use python -m torchrun-script

As the above python script will actually run.

Has anyone tried these instructions with Windows 10 Pro , using CUDA, and a NVidia GPU ?

erjenkins29 commented 1 year ago

I'm also stuck at this moment, attempting to test locally, installed all dependencies on similar conditions as above (cuda, windows, single nvidia gpu), only difference is I'm using a windows machine. From what I can tell in the code, nccl is required, and from what I can tell online, nccl is not supported for windows. So would be a good pull request to add in the README that windows is not supported.

Similar problem as #673 -- i have the same environment as what they mention, only I am on CUDA 11.7 instead of 11.8.

slalla commented 1 year ago

I am also having this issue. It seems that #697 is also related.

robertofuentesr commented 1 year ago

Hi, I am not expert. But I read a little this: https://pytorch.org/docs/stable/distributed.html (in summary nccl not working on windows better use gloo) and I added this to the code (in the beginning) and it works in windows 11: Fix:

import os 
import torch
os.environ['PL_TORCH_DISTRIBUTED_BACKEND'] = 'gloo'
os.environ['NCCL_DEBUG'] = 'INFO'
torch.distributed.init_process_group(backend="gloo")

this is my output:

running this: torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir "llama-2-7b/" --tokenizer_path "tokenizer.model" --max_seq_len 128 --max_batch_size 4

C:\Users\Rober\Desktop\llama 2\llama>torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir "llama-2-7b/" --tokenizer_path "tokenizer.model" --max_seq_len 128 --max_batch_size 4 NOTE: Redirects are currently not supported in Windows or MacOs. [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Rober]:29500 (system error: 10049 - unknown error). [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Rober]:29500 (system error: 10049 - unknown error). [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Rober]:29500 (system error: 10049 - unknown error). [W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Rober]:29500 (system error: 10049 - unknown error).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 35.25 seconds I believe the meaning of life is to be happy. I believe we are all born with the potential to be happy. The meaning of life is to be happy, but the way to get there is not always easy. The meaning of life is to be happy. It is not always easy to be happy, but it is possible. I believe that

==================================

Simply put, the theory of relativity states that

1) time, space, and mass are relative, and 2) the speed of light is constant, regardless of the observer’s velocity. The theory of relativity was first proposed by Albert Einstein in 1905. The theory was later modified and expanded by Einstein in 19

==================================

A brief message congratulating the team on the launch:

    Hi everyone,

    I just

wanted to say a big congratulations to the team on the launch of the new website.

    I think it looks fantastic and I'm sure it'll be a huge success.

    Please let me know if you need anything else from me.

    Best,

==================================

Translate English to French:

    sea otter => loutre de mer
    peppermint => menthe poivrée
    plush girafe => girafe peluche
    cheese =>

fromage maple syrup => sirop d'érable honey => miel pineapple => ananas apple => pomme pine tree => pin oak tree => chêne birch => bouleau cherry

==================================

If someone wants to add something more I would be happy I do not comprehend completely the mistake.

AlmogAmiga commented 1 year ago

Thanks @robertofuentesr it solved my problem! :D I was able to run the 7B model on Windows 10.

Another question I couldn't find an answer to is what are the system requirements to run the 7B model? When I ran it on my PC it took about ~20 GB of RAM from what I could see on the task manager. My GPU has 8GB VRAM and could also manage to run it with no issues.

Grunthos commented 8 months ago

When I select 'gloo' as the backend, it does not use the GPU -- it seems to default to the CPU. Is there a solution to this?

snowymo commented 6 months ago

Hi, I am not expert. But I read a little this: https://pytorch.org/docs/stable/distributed.html (in summary nccl not working on windows better use gloo) and I added this to the code (in the beginning) and it works in windows 11: Fix:
import os 
import torch
os.environ['PL_TORCH_DISTRIBUTED_BACKEND'] = 'gloo'
os.environ['NCCL_DEBUG'] = 'INFO'
torch.distributed.init_process_group(backend="gloo")
this is my output:

running this: torchrun --nproc_per_node 1 example_text_completion.py --ckpt_dir "llama-2-7b/" --tokenizer_path "tokenizer.model" --max_seq_len 128 --max_batch_size 4

==================================

If someone wants to add something more I would be happy I do not comprehend completely the mistake.

I tried the same code on the top of the example for chat.

Below is the error output.

NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求 的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求 的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求 的地址无效。).
[W C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求 的地址无效。).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "example_text_completion.py", line 69, in <module>
    fire.Fire(main)
  File "D:\pycharm\envs\sam\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\pycharm\envs\sam\lib\site-packages\fire\core.py", line 480, in _Fire
    target=component.__name__)
  File "D:\pycharm\envs\sam\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "example_text_completion.py", line 36, in main
    max_batch_size=max_batch_size,
  File "D:\projects\google\llama\llama\generation.py", line 120, in build
    params = json.loads(f.read())
  File "D:\pycharm\envs\sam\lib\json\__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "D:\pycharm\envs\sam\lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "D:\pycharm\envs\sam\lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 33072) of binary: D:\pycharm\envs\sam\python.exe
Traceback (most recent call last):
  File "D:\pycharm\envs\sam\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "D:\pycharm\envs\sam\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "D:\pycharm\envs\sam\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\pycharm\envs\sam\lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "D:\pycharm\envs\sam\lib\site-packages\torch\distributed\run.py", line 762, in main
    run(args)
  File "D:\pycharm\envs\sam\lib\site-packages\torch\distributed\run.py", line 756, in run
    )(*cmd_args)
  File "D:\pycharm\envs\sam\lib\site-packages\torch\distributed\launcher\api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "D:\pycharm\envs\sam\lib\site-packages\torch\distributed\launcher\api.py", line 248, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

snowymo commented 6 months ago

It seems I failed to download a complete params.json previously.

meta-llama / llama

Cant use Windows as fire packacge requires NCCL #699