meta-llama / llama3

The official Meta Llama 3 GitHub site
Other
21.42k stars 2.14k forks source link

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled #200

Closed stromyu520 closed 1 week ago

stromyu520 commented 3 weeks ago

W0509 01:09:39.797000 8201419456 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.

UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled

warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled") Traceback (most recent call last):

fbnav commented 3 weeks ago

Hi, could you provide more information on what platform/OS you are trying to run it on? Also, please try reinstalling PyTorch and try running it again. You can do it from here : https://pytorch.org/get-started/locally/

tungts1101 commented 3 weeks ago

I got the same error. My OS is windows 11. Here is what I got with pip show torch

Name: torch
Version: 2.3.0+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: d:\anaconda3\envs\system\lib\site-packages
Requires: filelock, fsspec, jinja2, mkl, networkx, sympy, typing-extensions
Required-by: fairscale, llama3, torchaudio, torchvision
tungts1101 commented 3 weeks ago

It seems like Windows doesn't support NCCL backend. Does it mean that I can only run llama3 on linux based machine?

tungts1101 commented 3 weeks ago

I have tried again with my Ubuntu 22.04 installed under WSL. The nccl error has disappeared but I still get this error when trying to run the example

E0510 15:01:32.269000 139785519843136 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 401) of binary: /home/tran/anaconda3/bin/python
Traceback (most recent call last):
  File "/home/tran/anaconda3/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
fbnav commented 3 weeks ago

Could you please provide the complete error message and your hardware specs, along with the code you tried to run?

NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if that works?

tungts1101 commented 2 weeks ago

Could you please provide the complete error message and your hardware specs, along with the code you tried to run?

NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo') and try if that works?

Above is the complete error message when I try to run the example example_chat_completion.py in README file. The OS is Ubuntu 22.04 with Intel core i5, RTX 3050 Laptop GPU.

tungts1101 commented 2 weeks ago

I think the root cause is the hardware doesn't meet the minimum requirement to run the llama-7B model.

fbnav commented 2 weeks ago

Yes it might be that. You will need a min VRAM of ~16GB to run the 8B model in fp16 precision.

fbnav commented 1 week ago

Closing this issue. Feel free to re-open if the issue persists.