Open ericoder960803 opened 1 year ago
the same problem is on Mac OS
Operating System: Mac OS, 11.5.2 Python Version: Python 3.11.3 Torch Version: 2.0.1
Same issue on Windows 11.
example_text_completion.py FAILED
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
I tried this, but nothing changed for me. Still trying to resolve this issue.
Same issue model llama-2-7b-chat What I tried: (I will update this)
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
Its not worked in Windows for me also
same here any solution ?
Same error here, while i'm trying to run example_chat_completion.py System: MacOS 14.0 (M1)
File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1268, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in [2023-10-08 20:52:17,432] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 70815) of binary: /opt/homebrew/opt/python@3.11/bin/python3.11
I got passed of this problem by installing gloo and going to generation.py and changing nccl to gloo find it and just replace it.
I have been trying to run it for 2 days.but i think 8gb vram and 32gb ram is not enough to run it. I will suggest you to use model from the_bloke from huggingface which are quantized and use llama.cpp to run on your mac . You can even run it using your mac gpu.
But if you want to train data then learn somehow how to train and then convert it to a quantize model runnable with llama.cpp
Si me inicializa el modelo pero aparece este error [2023-11-08 12:23:44,484] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.).
initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 104, in
fire.Fire(main) File "C:\Python311\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 35, in main generator = Llama.build( ^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\llama\generation.py", line 92, in build torch.cuda.set_device(local_rank) File "C:\Python311\Lib\site-packages\torch\cuda__init.py", line 404, in set_device torch._C._cuda_setDevice(device) ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' [2023-11-08 12:23:49,537] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22516) of binary: C:\Python311\python.exe Traceback (most recent call last): File " ", line 198, in _run_module_as_main File " main.py", line 7, in", line 88, in _run_code File "C:\Python311\Scripts\torchrun.exe\ File "C:\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\ init__.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:example_chat_completion.py FAILED
Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-08_12:23:49 host : CC rank : 0 (local_rank: 0) exitcode : 1 (pid: 22516) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
looks like this resolved my issue. even though another issue comes... but it is a diff one now.
have people to resolve this issue?
have people to resolve this issue?
I am able to run on Debian environment. Is there any solution for this issue on Windows. I tried to update generation.oy torch.distributed.init_process_group("nccl") --> generation.oy torch.distributed.init_process_group("gloo"). But it does not work.
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
Worked fine for me
I figured this out in Windows, you can change \llama\generation.py", line 62, in build torch.distributed.init_process_group("nccl") to torch.distributed.init_process_group("gloo")
it actually works for me.
Adding on to this, if you are mac and don't have cuda support, comment out the line:
torch.cuda.set_device(local_rank)
in ./llama/generate.py
./llama/generate.py
Description: When running the command, a RuntimeError is encountered with the message "unmatched '}' in format string." Run command
I encountered an issue while running a script that involves redirecting output. It seems that redirects are currently not supported in Windows environments This issue causes a runtime error with the following traceback:
Environment: Operating System: Windows10 Python Version: Python 3.9.13 Torch Version: 2.0.1
Please let me know if any further information is required to address this issue.