Closed stromyu520 closed 1 week ago
Hi, could you provide more information on what platform/OS you are trying to run it on? Also, please try reinstalling PyTorch and try running it again. You can do it from here : https://pytorch.org/get-started/locally/
I got the same error. My OS is windows 11. Here is what I got with pip show torch
Name: torch
Version: 2.3.0+cu118
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: d:\anaconda3\envs\system\lib\site-packages
Requires: filelock, fsspec, jinja2, mkl, networkx, sympy, typing-extensions
Required-by: fairscale, llama3, torchaudio, torchvision
It seems like Windows doesn't support NCCL backend. Does it mean that I can only run llama3
on linux based machine?
I have tried again with my Ubuntu 22.04 installed under WSL. The nccl
error has disappeared but I still get this error when trying to run the example
E0510 15:01:32.269000 139785519843136 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 0 (pid: 401) of binary: /home/tran/anaconda3/bin/python
Traceback (most recent call last):
File "/home/tran/anaconda3/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tran/anaconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Could you please provide the complete error message and your hardware specs, along with the code you tried to run?
NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use torch.distributed.init_process_group(backend='gloo')
and try if that works?
Could you please provide the complete error message and your hardware specs, along with the code you tried to run?
NCCL isn't supported on Windows. If you are running on Windows, can you please check here and use
torch.distributed.init_process_group(backend='gloo')
and try if that works?
Above is the complete error message when I try to run the example example_chat_completion.py
in README file. The OS is Ubuntu 22.04 with Intel core i5, RTX 3050 Laptop GPU.
I think the root cause is the hardware doesn't meet the minimum requirement to run the llama-7B model.
Yes it might be that. You will need a min VRAM of ~16GB to run the 8B model in fp16 precision.
Closing this issue. Feel free to re-open if the issue persists.
W0509 01:09:39.797000 8201419456 torch/distributed/elastic/multiprocessing/redirects.py:27] NOTE: Redirects are currently not supported in Windows or MacOs.
UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled
warnings.warn("Attempted to get default timeout for nccl backend, but NCCL support is not compiled") Traceback (most recent call last):