ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 16079)

13555957595 commented 11 months ago

Hi , I try deploy llama2 today , and found the issue : (llama) [root@iZbp1iobggdz6jrvvlgpx4Z llama]# torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir llama-2-7b/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 12.05 seconds Traceback (most recent call last): File "/home/llm/llama/example_text_completion.py", line 69, in fire.Fire(main) File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/llm/llama/example_text_completion.py", line 56, in main results = generator.text_completion( File "/home/llm/llama/llama/generation.py", line 265, in text_completion generation_tokens, generation_logprobs = self.generate( File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/llm/llama/llama/generation.py", line 165, in generate total_len = min(params.max_seq_len, max_gen_len + max_prompt_len) TypeError: can only concatenate str (not "int") to str ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 16079) of binary: /home/llm/conda3/envs/llama/bin/python Traceback (most recent call last): File "/home/llm/conda3/envs/llama/bin/torchrun", line 8, in sys.exit(main()) File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/llm/conda3/envs/llama/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-09-28_22:55:34 host : iZbp1iobggdz6jrvvlgpx4Z rank : 0 (local_rank: 0) exitcode : 1 (pid: 16079) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Please help me resolve the issue

Thank you

Suenym commented 11 months ago

Same error

13555957595 commented 11 months ago

It can run succssfully at some times , I guess I need more big VGA memory.

:)

WuhanMonkey commented 11 months ago

Hi @13555957595 , it would be useful if you can share more about your system specs here especially the GPU memory you have. Here is a calculator you can use to assess memory requirements - https://huggingface.co/spaces/hf-accelerate/model-memory-usage

anhbsn commented 11 months ago

Same error

Hi bro, Have you fixed it yet?

Matheart commented 11 months ago

Same error here

Matheart commented 11 months ago

Seems I have fixed the issue, the main reason is that fire.fire(main) does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add --temperature 0.6 --top_p 0.9 --max_gen_len 64 at the end of your command.

anhbsn commented 11 months ago

Seems I have fixed the issue, the main reason is that fire.fire(main) does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add --temperature 0.6 --top_p 0.9 --max_gen_len 64 at the end of your command.

Hi br, is it done? I added --temperature 0.6 --top_p 0.9 --max_gen_len 64 at the end of my command. And I responsed this message:

Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph

And it exits, returning the default command line.

anhbsn commented 11 months ago

Seems I have fixed the issue, the main reason is that fire.fire(main) does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add --temperature 0.6 --top_p 0.9 --max_gen_len 64 at the end of your command.

Hi br, is it done? I added --temperature 0.6 --top_p 0.9 --max_gen_len 64 at the end of my command. And I responsed this message:

Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph

And it exits, returning the default command line.

Oh yeah, It worked!

a123aaa commented 11 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)``--temperature 0.6 --top_p 0.9 --max_gen_len 64

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

anhbsn commented 11 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues.

This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

a123aaa commented 11 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues.

This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

His return result is as follows: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 7.85 seconds I believe the meaning of life is to find happiness and fulfillment. Here are some reasons why:

Happiness is a fundamental human need: As humans, we have a natural desire for happiness and satisfaction. It is a basic human need, along with food, shelter, and clothing.

Happiness is a

==================================

Simply put, the theory of relativity states that

1) the laws of physics are the same for all observers in uniform motion relative to one another, and 2) the speed of light is always constant, regardless of the motion of the observer. Einstein's theory of relativity revolutionized our understanding of space and time. Here are some key

==================================

A brief message congratulating the team on the launch:

    Hi everyone,

    I just

wanted to take a moment to congratulate the team on the successful launch of our new product! This has been a huge effort and I'm thrilled to see it come to life. Your hard work and dedication have paid off and I'm so proud of each and every one of you. Let

==================================

Translate English to French:

    sea otter => loutre de mer
    peppermint => menthe poivrée
    plush girafe => girafe peluche
    cheese =>

fromage lemon => citron lion => lion elephant => éléphant candy => bonbon kangaroo => kangourou sunflower => tulipe cat => chat dog => chien

==================================

a123aaa commented 11 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues. This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

His return result is as follows: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 7.85 seconds I believe the meaning of life is to find happiness and fulfillment. Here are some reasons why:

Happiness is a fundamental human need: As humans, we have a natural desire for happiness and satisfaction. It is a basic human need, along with food, shelter, and clothing.

Happiness is a

==================================

Simply put, the theory of relativity states that

the laws of physics are the same for all observers in uniform motion relative to one another, and 2) the speed of light is always constant, regardless of the motion of the observer. Einstein's theory of relativity revolutionized our understanding of space and time. Here are some key

==================================

A brief message congratulating the team on the launch:
    Hi everyone,

    I just 
wanted to take a moment to congratulate the team on the successful launch of our new product! This has been a huge effort and I'm thrilled to see it come to life. Your hard work and dedication have paid off and I'm so proud of each and every one of you. Let

==================================

Translate English to French:
    sea otter => loutre de mer
    peppermint => menthe poivrée
    plush girafe => girafe peluche
    cheese =>
fromage lemon => citron lion => lion elephant => éléphant candy => bonbon kangaroo => kangourou sunflower => tulipe cat => chat dog => chien

==================================

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues. This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

His return result is as follows: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 7.85 seconds I believe the meaning of life is to find happiness and fulfillment. Here are some reasons why:

Happiness is a fundamental human need: As humans, we have a natural desire for happiness and satisfaction. It is a basic human need, along with food, shelter, and clothing.

Happiness is a

==================================

Simply put, the theory of relativity states that

the laws of physics are the same for all observers in uniform motion relative to one another, and 2) the speed of light is always constant, regardless of the motion of the observer. Einstein's theory of relativity revolutionized our understanding of space and time. Here are some key

==================================

A brief message congratulating the team on the launch:
    Hi everyone,

    I just 
wanted to take a moment to congratulate the team on the successful launch of our new product! This has been a huge effort and I'm thrilled to see it come to life. Your hard work and dedication have paid off and I'm so proud of each and every one of you. Let

==================================

Translate English to French:
    sea otter => loutre de mer
    peppermint => menthe poivrée
    plush girafe => girafe peluche
    cheese =>
fromage lemon => citron lion => lion elephant => éléphant candy => bonbon kangaroo => kangourou sunflower => tulipe cat => chat dog => chien

==================================

But I didn't ask him questions

anhbsn commented 11 months ago

@a123aaa In the example_text_completion.py has a prompts:

Do you see the # signs appearing in "prompts", those are examples of what the FacebookResearch team used to change the "PyTorch is" prompt (which is currently uncommented).

To change the question, you just need to change the prompt here. Because you and I are both using llama-2-7B and not llama-2-7B-chat, we cannot ask and receive responses directly from the command line.

a123aaa commented 11 months ago

python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64

Thank you very much bro, actually I use 'llama-2-7B-chat'. How can I solve the following problem, he has been bothering me for two days!

python3 -m torch.distributed.run --nproc_per_node 1 example_chat_completion.py --ckpt_dir llama-2-7b-chat --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64

initializing model parallel with size 1 initializing ddp with size 1 initializing pipeline with size 1 Loaded in 7.92 seconds Traceback (most recent call last): File "/home/one/AIEmployee/Llama/llama/example_chat_completion.py", line 104, in fire.Fire(main) File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/one/AIEmployee/Llama/llama/example_chat_completion.py", line 87, in main results = generator.chat_completion( File "/home/one/AIEmployee/Llama/llama/llama/generation.py", line 364, in chat_completion generation_tokens, generation_logprobs = self.generate( File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/one/AIEmployee/Llama/llama/llama/generation.py", line 160, in generate assert bsz <= params.max_batch_size, (bsz, params.max_batch_size) AssertionError: (6, 4) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 379983) of binary: /home/one/miniconda3/envs/Llama2/bin/python3 Traceback (most recent call last): File "/home/one/miniconda3/envs/Llama2/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/one/miniconda3/envs/Llama2/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in main() File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/one/miniconda3/envs/Llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_chat_completion.py FAILED

Failures:
------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-10-13_16:14:02 host : alpha rank : 0 (local_rank: 0) exitcode : 1 (pid: 379983) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

anhbsn commented 11 months ago

@a123aaa Oh so you use llama-2-7B-chat, I thought you used llama-2-7B.

But above I see that you have already run it, and there is also a response on the command line. Why do you get this error when you run it again?

And I read the documentation in FacebookResearch and saw that their instructions are --max_seq_len 512 --max_batch_size 6. Please try again!

Besides, I'm using a non-GPU computer that's only been run a few times, so I really don't know the problem with this error, even though I've encountered this error before!

a123aaa commented 11 months ago

@a123aaa Oh so you use llama-2-7B-chat, I thought you used llama-2-7B.

But above I see that you have already run it, and there is also a response on the command line. Why do you get this error when you run it again?

And I read the documentation in FacebookResearch and saw that their instructions are --max_seq_len 512 --max_batch_size 6. Please try again!

Besides, I'm using a non-GPU computer that's only been run a few times, so I really don't know the problem with this error, even though I've encountered this error before!

Thank you very much bro，I'll look for any other way。

Vatsal1106Virani commented 10 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues.

This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

Not working for example_chat_completion.py file.What is solution for that?

ymx10086 commented 10 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues. This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

Not working for example_chat_completion.py file.What is solution for that?

Seem that I met the same question and hope the solution

WuhanMonkey commented 8 months ago

The problem in this thread is likely due to limited VRAM for model inference. Reducing the max_gen_len or max_batch_size should resolve it issue. Closing.

tinaty commented 6 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues.

This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

Hi @anhbsn , I use your command line but i still get the error torch.distributed.elastic.multiprocessing.errors.ChildFailedError. I use a non-GPU machine (Mac)

GG6Bond commented 5 months ago

Seems I have fixed the issue, the main reason is that does not keep the default values of the parameters, which make some of the parameters "" (type str), the way to fix this is to add at the end of your command.fire.fire(main)--temperature 0.6 --top_p 0.9 --max_gen_len 64 ``

Hi br, is it done? I added at the end of my command. And I responsed this message:--temperature 0.6 --top_p 0.9 --max_gen_len 64 Loaded in 115.20 seconds tokens generated: 64, tokens/sec: 1.60 PyTorch is a deep learning framework based on the Python programming language, which allows developers to create and train neural networks for a variety of applications. It is widely used in industries such as computer vision, natural language processing, and robotics. One of the key features of PyTorch is its ability to perform dynamic computation graph And it exits, returning the default command line.

Oh yeah, It worked!

Hi br, help me, I use'torchrun --nproc_per_node 1 example_text_completion.py \ --ckpt_dir 7B/ \ --tokenizer_path tokenizer.model \ --max_seq_len 128 --max_batch_size 6 --temperature 0.6 --top_p 0.9 --max_gen_len 64'.but alse don't work.

Hi, I think you wrote the wrong link. --ckpt_dir llama-2-7b not --ckpt_dir 7B. And You should change torchrun to python3 -m torch.distributed.run if error still continues.

This is my command line: python3 -m torch.distributed.run --nproc_per_node 1 example_text_completion.py --ckpt_dir llama-2-7b --tokenizer_path tokenizer.model --max_seq_len 512 --max_batch_size 4 --temperature 0.6 --top_p 0.9 --max_gen_len 64.

thank you！Successfully solved with your command

meta-llama / llama

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 16079) #830

example_text_completion.py FAILED

example_chat_completion.py FAILED