openreasoner / openr

OpenR: An Open Source Framework for Advanced Reasoning with Large Language Models
https://openreasoner.github.io/
MIT License
750 stars 51 forks source link

当我执行 sh scripts/eval/cot_greedy.sh,会报错 requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate? #18

Open Brainth opened 4 days ago

Brainth commented 4 days ago

System Info

linux

Who can help?

No response

Information

Tasks

Reproduction

当我执行 sh scripts/eval/cot_greedy.sh,会报错 File "/home/whg/openr-main/reason/evaluation/evaluate.py", line 195, in parallel_evaluate_test_dataset(config.method, solver_fn, save_dir) File "/home/whg/openr-main/reason/evaluation/evaluate.py", line 129, in parallel_evaluate_test_dataset for i, (problem_inst, result, output) in enumerate( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/util/actor_pool.py", line 170, in get_generator yield self.get_next_unordered() File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/util/actor_pool.py", line 370, in get_next_unordered return ray.get(future) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, *kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/worker.py", line 2691, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(MissingSchema): ray::RemoteMathEvaluator.evaluate_problem() (pid=29384, ip=10.44.115.98, actor_id=7cdae62f4df49eaf1376b46601000000, repr=<reason.evaluation.evaluator.RemoteMathEvaluator object at 0x7eef295bfe20>) File "/home/whg/openr-main/reason/evaluation/evaluator.py", line 116, in evaluate_problem solution: SolutionOutput = solver_fn(problem_inst, self.lm_call, self.rm_call) File "/home/whg/openr-main/reason/evaluation/methods.py", line 33, in cot return best_of_n(config, gen_config, problem_inst, llm_call, rm_call) File "/home/whg/openr-main/reason/evaluation/methods.py", line 54, in best_of_n output = lm_call(prompt, gen_config) File "/home/whg/openr-main/reason/inference/lm_call.py", line 28, in call return _generate_fastchat( File "/home/whg/openr-main/reason/inference/text_generation.py", line 53, in _generate_fastchat response = requests.post( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/sessions.py", line 575, in request prep = self.prepare_request(req) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/sessions.py", line 484, in prepare_request p.prepare( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/models.py", line 367, in prepare self.prepare_url(url, params) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/models.py", line 438, in prepare_url raise MissingSchema( requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?

Expected behavior

预期正常执行

YanSong97 commented 4 days ago

Hi Brainth, this is probably due to the inappropriate set-up of LM service. Did you successfully run the code in Quickstart (changing the variable name and running Start LM & RM Services)?

Brainth commented 4 days ago

我原先执行了“sh reason/llm_service/create_service_math_shepherd.sh” 然后修改了(--LM, --RM)对应的模型路径 然后执行 sh scripts/eval/cot_greedy.sh,出现了错误 requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?

看了你给我的回复之后,我检查了一下运行的服务: image 感觉好像少了一个服务,是不是执行了“sh reason/llm_service/create_service_math_shepherd.sh”之后,这里应该运行这两个服务才对?

然后我将这个服务杀死,想要重新执行“sh reason/llm_service/create_service_math_shepherd.sh” image 但我发现,重新执行之后,没有服务重新启动,原先被我杀死的服务也消失了

重新执行脚本“sh scripts/eval/cot_greedy.sh”,有了新的报错 requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=28777): Max retries exceeded with url: /get_worker_address (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcf0eb172b0>: Failed to establish a new connection: [Errno 111] Connection refused'))

这个报错和原来的报错不一样了,我觉得是因为两个服务都没有启动的原因 最开始的报错,我猜测应该是缺少了一个服务

现在的问题变成了,我重新执行“sh reason/llm_service/create_service_math_shepherd.sh” 为什么两个服务都没有启动?

YanSong97 commented 4 days ago

Hi Brainth,

You are right about the lack of service error. What create_service_math_shepherd.sh does is to set up two LLM services for generation and value inference respectively. The first service is running reason.llm_service.workers with $MODEL_PATH , and the second is running reason.llm_service.workers.reward_model_worker with $VALUE_MODEL_PATH. By default, we run these two services on separate hardware ($NUM_LM_WORKER=1, $NUM_RM_WORKER=1).

Running ps -ef | grep open you are likely to see two services running:

anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /mnt/nasdata/xxx/llms/huggingface/math-shepherd-mistral-7b-prm --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011
anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.vllm_worker --model-path /mnt/nasdata/xxx/llms/huggingface/mistral-7b-sft --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010 --dtype bfloat16 --swap-space 32

For your case, you need to kill these two services and the FastChat tmux session to be able to rerun the services.

ziyuwan commented 1 day ago

In the latest commit, I added a checker here, if the model_name is not logged in the controller, it will raise an Exception, you can pull it.

And for how to kill the service process, just use tmux kill-session -t {Your Session Name}, I'll add this command in Readme later.

Brainth commented 1 day ago

我修改了$NUM_LM_WORKER=1,$NUM_RM_WORKER=1,在设备上执行create_service_math_shepherd.sh,reward_model_worker可以正常启动,但是vllm_worker无法启动,请问有定位方法吗

ziyuwan commented 1 day ago

Hello! Can you provide the number and type of GPUs as well as the error message about vLLM.

Brainth commented 1 day ago

image 但是关于vLLM,我没有看到报错信息 现在GPU0正在运行reward_model_worker

我执行了tmux attach-session -t FastChat1,突然看到了报错信息 No module named ‘vllm’ 现在已经成功运行 image

谢谢

ziyuwan commented 1 day ago

No module named ‘vllm’

I think you should first make sure you have successfully installed vllm.

jeffyeylw commented 1 day ago

Once I ran the create_service_math_shepherd.sh it seems both services are running? Am I correct? 下面是ps -ef | grep open截图:

image

But very shortly, the llm service seems to close itself? And I got the same error as "requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?"

Then I check my running services again, I found out that it seems only RM services running. I am not sure why the other services close itself:

image

Brainth commented 19 hours ago

我遇到了同样的问题

ziyuwan commented 19 hours ago

Hi,maybe you can check the outputed logs or just attach into the tmux session by

tmux a -t FastChat
Brainth commented 18 hours ago

解决方案: 1、学习相关tmux知识,用于定位,llm服务突然退出是因为报错了,进到tmux里面查看报错信息 2、我自己做的两个改动: a、去掉代理 tmux send-keys "unset http_proxy" Enter tmux send-keys "unset https_proxy" Enter b、根据报错信息注释掉 vllm_worker.py:104 use_beam_search=use_beam_search

jeffyeylw commented 8 hours ago

解决方案: 1、学习相关tmux知识,用于定位,llm服务突然退出是因为报错了,进到tmux里面查看报错信息 2、我自己做的两个改动: a、去掉代理 tmux send-keys "unset http_proxy" Enter tmux send-keys "unset https_proxy" Enter b、根据报错信息注释掉 vllm_worker.py:104 use_beam_search=use_beam_search

Thanks that helps a lot!

我后续在log里发现报错地方在运行两个service的时候但是这个log里面并没有说明具体的报错原因: image

ziyuwan commented 7 hours ago

Do you mean the ERROR here? it's not an error. It's just an output by log.error(), can you please check the error message by attaching into the tmux session