Closed Brainth closed 1 week ago
Hi Brainth, this is probably due to the inappropriate set-up of LM service. Did you successfully run the code in Quickstart
(changing the variable name and running Start LM & RM Services
)?
我原先执行了“sh reason/llm_service/create_service_math_shepherd.sh” 然后修改了(--LM, --RM)对应的模型路径 然后执行 sh scripts/eval/cot_greedy.sh,出现了错误 requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?
看了你给我的回复之后,我检查了一下运行的服务: 感觉好像少了一个服务,是不是执行了“sh reason/llm_service/create_service_math_shepherd.sh”之后,这里应该运行这两个服务才对?
然后我将这个服务杀死,想要重新执行“sh reason/llm_service/create_service_math_shepherd.sh” 但我发现,重新执行之后,没有服务重新启动,原先被我杀死的服务也消失了
重新执行脚本“sh scripts/eval/cot_greedy.sh”,有了新的报错 requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=28777): Max retries exceeded with url: /get_worker_address (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcf0eb172b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
这个报错和原来的报错不一样了,我觉得是因为两个服务都没有启动的原因 最开始的报错,我猜测应该是缺少了一个服务
现在的问题变成了,我重新执行“sh reason/llm_service/create_service_math_shepherd.sh” 为什么两个服务都没有启动?
Hi Brainth,
You are right about the lack of service error. What create_service_math_shepherd.sh
does is to set up two LLM services for generation and value inference respectively. The first service is running reason.llm_service.workers
with $MODEL_PATH
, and the second is running reason.llm_service.workers.reward_model_worker
with $VALUE_MODEL_PATH
. By default, we run these two services on separate hardware ($NUM_LM_WORKER=1
, $NUM_RM_WORKER=1
).
Running ps -ef | grep open
you are likely to see two services running:
anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /mnt/nasdata/xxx/llms/huggingface/math-shepherd-mistral-7b-prm --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011
anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.vllm_worker --model-path /mnt/nasdata/xxx/llms/huggingface/mistral-7b-sft --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010 --dtype bfloat16 --swap-space 32
For your case, you need to kill these two services and the FastChat
tmux session to be able to rerun the services.
In the latest commit, I added a checker here, if the model_name is not logged in the controller, it will raise an Exception, you can pull it.
And for how to kill the service process, just use tmux kill-session -t {Your Session Name}
, I'll add this command in Readme later.
我修改了$NUM_LM_WORKER=1,$NUM_RM_WORKER=1,在设备上执行create_service_math_shepherd.sh,reward_model_worker可以正常启动,但是vllm_worker无法启动,请问有定位方法吗
Hello!
Can you provide the number and type of GPUs as well as the error message about vLLM
.
但是关于vLLM,我没有看到报错信息 现在GPU0正在运行reward_model_worker
我执行了tmux attach-session -t FastChat1,突然看到了报错信息 No module named ‘vllm’ 现在已经成功运行
谢谢
No module named ‘vllm’
I think you should first make sure you have successfully installed vllm
.
Once I ran the create_service_math_shepherd.sh it seems both services are running? Am I correct? 下面是ps -ef | grep open截图:
But very shortly, the llm service seems to close itself? And I got the same error as "requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?"
Then I check my running services again, I found out that it seems only RM services running. I am not sure why the other services close itself:
我遇到了同样的问题
Hi,maybe you can check the outputed logs or just attach into the tmux session by
tmux a -t FastChat
解决方案: 1、学习相关tmux知识,用于定位,llm服务突然退出是因为报错了,进到tmux里面查看报错信息 2、我自己做的两个改动: a、去掉代理 tmux send-keys "unset http_proxy" Enter tmux send-keys "unset https_proxy" Enter b、根据报错信息注释掉 vllm_worker.py:104 use_beam_search=use_beam_search
解决方案: 1、学习相关tmux知识,用于定位,llm服务突然退出是因为报错了,进到tmux里面查看报错信息 2、我自己做的两个改动: a、去掉代理 tmux send-keys "unset http_proxy" Enter tmux send-keys "unset https_proxy" Enter b、根据报错信息注释掉 vllm_worker.py:104 use_beam_search=use_beam_search
Thanks that helps a lot!
我后续在log里发现报错地方在运行两个service的时候但是这个log里面并没有说明具体的报错原因:
Do you mean the ERROR here? it's not an error. It's just an output by log.error(), can you please check the error message by attaching into the tmux session
按照前面的方法,已成功运行create_service_qwen2.5_math_hf.sh,查看tmux日志,也没有报错,但是运行python reason/evaluation/evaluate.py,还是报错Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate? 定位到报错的位置:text_generation.py 通过打印的worker_addr为空,少了网址的http,所以导致报错。 请问,这个报错怎么解决?
hi @xiaoweiweixiao
have your problems been solved now?
What's your create_llm_service scripts looks like?
What's your --LM
, and --RM
during evaluating?
Or can you provide more information by attaching in the tmux session and see the error message?
hi @xiaoweiweixiao have your problems been solved now? What's your create_llm_service scripts looks like? What's your
--LM
, and--RM
during evaluating? Or can you provide more information by attaching in the tmux session and see the error message?
感谢,已经解决,--LM 和 --RM中的参数要和log文件中的model name对应,只用写model name就行,不用写完整路径。
System Info
linux
Who can help?
No response
Information
Tasks
Reproduction
当我执行 sh scripts/eval/cot_greedy.sh,会报错 File "/home/whg/openr-main/reason/evaluation/evaluate.py", line 195, in parallel_evaluate_test_dataset(config.method, solver_fn, save_dir) File "/home/whg/openr-main/reason/evaluation/evaluate.py", line 129, in parallel_evaluate_test_dataset for i, (problem_inst, result, output) in enumerate( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/util/actor_pool.py", line 170, in get_generator yield self.get_next_unordered() File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/util/actor_pool.py", line 370, in get_next_unordered return ray.get(future) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, *kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/worker.py", line 2691, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(MissingSchema): ray::RemoteMathEvaluator.evaluate_problem() (pid=29384, ip=10.44.115.98, actor_id=7cdae62f4df49eaf1376b46601000000, repr=<reason.evaluation.evaluator.RemoteMathEvaluator object at 0x7eef295bfe20>) File "/home/whg/openr-main/reason/evaluation/evaluator.py", line 116, in evaluate_problem solution: SolutionOutput = solver_fn(problem_inst, self.lm_call, self.rm_call) File "/home/whg/openr-main/reason/evaluation/methods.py", line 33, in cot return best_of_n(config, gen_config, problem_inst, llm_call, rm_call) File "/home/whg/openr-main/reason/evaluation/methods.py", line 54, in best_of_n output = lm_call(prompt, gen_config) File "/home/whg/openr-main/reason/inference/lm_call.py", line 28, in call return _generate_fastchat( File "/home/whg/openr-main/reason/inference/text_generation.py", line 53, in _generate_fastchat response = requests.post( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/sessions.py", line 575, in request prep = self.prepare_request(req) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/sessions.py", line 484, in prepare_request p.prepare( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/models.py", line 367, in prepare self.prepare_url(url, params) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/models.py", line 438, in prepare_url raise MissingSchema( requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?
Expected behavior
预期正常执行