meta-llama / llama

Inference code for Llama models
Other
56.08k stars 9.53k forks source link

Error 10049 #720

Open TheAnomalous opened 1 year ago

TheAnomalous commented 1 year ago

Can someone please help me understand what I'm doing wrong here?

(llama_env) C:\Users\afull>torchrun --nproc_per_node 1 example_completion.py \ NOTE: Redirects are currently not supported in Windows or MacOs. [W C:\b\abs_abjetg6_iu\croot\pytorch_1686932924616\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Adam]:29500 (system error: 10049 - The requested address is not valid in its context.). [W C:\b\abs_abjetg6_iu\croot\pytorch_1686932924616\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [Adam]:29500 (system error: 10049 - The requested address is not valid in its context.). D:\anaconda3\envs\llama_env\python.exe: can't open file 'example_completion.py': [Errno 2] No such file or directory ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 39992) of binary: D:\anaconda3\envs\llama_env\python.exe Traceback (most recent call last): File "D:\anaconda3\envs\llama_env\Scripts\torchrun-script.py", line 10, in sys.exit(main()) File "D:\anaconda3\envs\llama_env\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init.py", line 346, in wrapper return f(*args, **kwargs) File "D:\anaconda3\envs\llama_env\lib\site-packages\torch\distributed\run.py", line 794, in main run(args) File "D:\anaconda3\envs\llama_env\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "D:\anaconda3\envs\llama_env\lib\site-packages\torch\distributed\launcher\api.py", line 134, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "D:\anaconda3\envs\llama_env\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_completion.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-26_03:16:11 host : Adam rank : 0 (local_rank: 0) exitcode : 2 (pid: 39992) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
EmanuelaBoros commented 1 year ago

The error is telling you that the example_completion.py file does not exist. Check what you are trying to run.

TheAnomalous commented 1 year ago

Thanks I think I got it sorted... mostly.

Just running into this now

(base) C:\Users\afull\llama>torchrun --nproc_per_node 1 example_completion.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama/tokenizer.model --max_seq_len 128 --max_batch_size 4 failed to create process.

image

EmanuelaBoros commented 1 year ago

@TheAnomalous Well, in your command you are trying to run example_completion.py but there is no file called example_completion.py as I can see in your print screen. You need to choose between example_chat_completion.py or example_text_completion.py.

Nehe12 commented 11 months ago

I have a this error in my code this is my error complete torchrun --nproc_per_node 1 example_chat_completion.py ./llama-2-13b ./tokenizer.model max_seq_len 512 max_batch_size [2023-11-08 12:23:44,484] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs. [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.). [W socket.cpp:663] [c10d] The client socket has failed to connect to [CC]:29500 (system error: 10049 - La direcci¾n solicitada no es vßlida en este contexto.).

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Traceback (most recent call last): File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 104, in fire.Fire(main) File "C:\Python311\Lib\site-packages\fire\core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\example_chat_completion.py", line 35, in main generator = Llama.build( ^^^^^^^^^^^^ File "C:\Users\CC\Documents\INTELIGENCIA_ARTIFICIAL\prueba\llama\llama\generation.py", line 92, in build torch.cuda.set_device(local_rank) File "C:\Python311\Lib\site-packages\torch\cuda__init.py", line 404, in set_device torch._C._cuda_setDevice(device) ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: module 'torch._C' has no attribute '_cuda_setDevice' [2023-11-08 12:23:49,537] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 22516) of binary: C:\Python311\python.exe Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "C:\Python311\Scripts\torchrun.exe\main.py", line 7, in File "C:\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\init__.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 806, in main run(args) File "C:\Python311\Lib\site-packages\torch\distributed\run.py", line 797, in run elastic_launch( File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-11-08_12:23:49 host : CC rank : 0 (local_rank: 0) exitcode : 1 (pid: 22516) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
chirag-periwal commented 11 months ago

Were you able to resolve the issue Nehe12. I am getting the same error?