Open nico1995lee opened 5 months ago
Hi, I tried fixing this error. Could you try again? Thanks.
Hi, Thanks for your reply. This error has been resolved, but there are new error:
File "/mlsteam/data/LLM/llm-reasoners/reasoners/lm/llama_2_model.py", line 146, in generate
assert max_prompt_size <= params.max_seq_len, f"prompt length exceeds limit: {max_prompt_size} > {params.max_seq_len}"
AssertionError: prompt length exceeds limit: 2054 > 2048
[2024-04-15 04:49:16,875] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2659) of binary: /mlsteam/data/LLM/llama/venv/bin/python
Traceback (most recent call last):
File "/mlsteam/data/LLM/llama/venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/mlsteam/data/LLM/llama/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/mlsteam/data/LLM/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/mlsteam/data/LLM/llama/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/mlsteam/data/LLM/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mlsteam/data/LLM/llama/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
examples/rap_gsm8k/inference.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-15_04:49:16
host : 8dede9e2fb55
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2659)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Could you specify the script you were running, so that I can reproduce the error?
I'm trying to implement RAP in the gsm8k dataset, so I executed the following command:
torchrun --nproc-per-node 1 --master-port 6676 examples/rap_gsm8k/inference.py --base_lm llama-2 --llama_2_ckpts /mlsteam/data/LLM/llama/ --llama_size 7B
Hi, I tried running this command, but couldn't reproduce the error... It seems to be due to an inappropriate processing of the edge case. It might be easier to debug by printing out the input and output.
Besides, I noticed that you are using llama-2 7b, which is a relatively weak model and may not follow the demonstration format. This could also cause unexpected errors. We have supported Llama-3 and you may try whether a stronger model would solve this problem.
Thanks!