Brainth commented 1 month ago

System Info

linux

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the codebase (such as scrips/, ...)
[ ] My own task or dataset (give details below)

Reproduction

当我执行 sh scripts/eval/cot_greedy.sh，会报错 File "/home/whg/openr-main/reason/evaluation/evaluate.py", line 195, in parallel_evaluate_test_dataset(config.method, solver_fn, save_dir) File "/home/whg/openr-main/reason/evaluation/evaluate.py", line 129, in parallel_evaluate_test_dataset for i, (problem_inst, result, output) in enumerate( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/tqdm/std.py", line 1181, in iter for obj in iterable: File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/util/actor_pool.py", line 170, in get_generator yield self.get_next_unordered() File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/util/actor_pool.py", line 370, in get_next_unordered return ray.get(future) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, *kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(args, kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/worker.py", line 2691, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(MissingSchema): ray::RemoteMathEvaluator.evaluate_problem() (pid=29384, ip=10.44.115.98, actor_id=7cdae62f4df49eaf1376b46601000000, repr=<reason.evaluation.evaluator.RemoteMathEvaluator object at 0x7eef295bfe20>) File "/home/whg/openr-main/reason/evaluation/evaluator.py", line 116, in evaluate_problem solution: SolutionOutput = solver_fn(problem_inst, self.lm_call, self.rm_call) File "/home/whg/openr-main/reason/evaluation/methods.py", line 33, in cot return best_of_n(config, gen_config, problem_inst, llm_call, rm_call) File "/home/whg/openr-main/reason/evaluation/methods.py", line 54, in best_of_n output = lm_call(prompt, gen_config) File "/home/whg/openr-main/reason/inference/lm_call.py", line 28, in call return _generate_fastchat( File "/home/whg/openr-main/reason/inference/text_generation.py", line 53, in _generate_fastchat response = requests.post( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/api.py", line 115, in post return request("post", url, data=data, json=json, kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/sessions.py", line 575, in request prep = self.prepare_request(req) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/sessions.py", line 484, in prepare_request p.prepare( File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/models.py", line 367, in prepare self.prepare_url(url, params) File "/root/anaconda3/envs/open_reasonser/lib/python3.10/site-packages/requests/models.py", line 438, in prepare_url raise MissingSchema( requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?

Expected behavior

预期正常执行

YanSong97 commented 1 month ago

Hi Brainth, this is probably due to the inappropriate set-up of LM service. Did you successfully run the code in Quickstart (changing the variable name and running Start LM & RM Services)?

Brainth commented 1 month ago

我原先执行了“sh reason/llm_service/create_service_math_shepherd.sh” 然后修改了(--LM, --RM)对应的模型路径然后执行 sh scripts/eval/cot_greedy.sh，出现了错误 requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?

看了你给我的回复之后，我检查了一下运行的服务：感觉好像少了一个服务，是不是执行了“sh reason/llm_service/create_service_math_shepherd.sh”之后，这里应该运行这两个服务才对？

然后我将这个服务杀死，想要重新执行“sh reason/llm_service/create_service_math_shepherd.sh” 但我发现，重新执行之后，没有服务重新启动，原先被我杀死的服务也消失了

重新执行脚本“sh scripts/eval/cot_greedy.sh”，有了新的报错 requests.exceptions.ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=28777): Max retries exceeded with url: /get_worker_address (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcf0eb172b0>: Failed to establish a new connection: [Errno 111] Connection refused'))

这个报错和原来的报错不一样了，我觉得是因为两个服务都没有启动的原因最开始的报错，我猜测应该是缺少了一个服务

现在的问题变成了，我重新执行“sh reason/llm_service/create_service_math_shepherd.sh” 为什么两个服务都没有启动？

YanSong97 commented 1 month ago

Hi Brainth,

You are right about the lack of service error. What create_service_math_shepherd.sh does is to set up two LLM services for generation and value inference respectively. The first service is running reason.llm_service.workers with $MODEL_PATH , and the second is running reason.llm_service.workers.reward_model_worker with $VALUE_MODEL_PATH. By default, we run these two services on separate hardware ($NUM_LM_WORKER=1, $NUM_RM_WORKER=1).

Running ps -ef | grep open you are likely to see two services running:

anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.reward_model_worker --model-path /mnt/nasdata/xxx/llms/huggingface/math-shepherd-mistral-7b-prm --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30011 --worker-address http://0.0.0.0:30011

anaconda3/envs/openr/bin/python3 -m reason.llm_service.workers.vllm_worker --model-path /mnt/nasdata/xxx/llms/huggingface/mistral-7b-sft --controller-address http://0.0.0.0:28777 --host 0.0.0.0 --port 30010 --worker-address http://0.0.0.0:30010 --dtype bfloat16 --swap-space 32

For your case, you need to kill these two services and the FastChat tmux session to be able to rerun the services.

ziyuwan commented 1 month ago

In the latest commit, I added a checker here, if the model_name is not logged in the controller, it will raise an Exception, you can pull it.

And for how to kill the service process, just use tmux kill-session -t {Your Session Name}, I'll add this command in Readme later.

Brainth commented 1 month ago

我修改了$NUM_LM_WORKER=1，$NUM_RM_WORKER=1，在设备上执行create_service_math_shepherd.sh，reward_model_worker可以正常启动，但是vllm_worker无法启动，请问有定位方法吗

ziyuwan commented 1 month ago

Hello！ Can you provide the number and type of GPUs as well as the error message about vLLM.

Brainth commented 1 month ago

但是关于vLLM，我没有看到报错信息现在GPU0正在运行reward_model_worker

我执行了tmux attach-session -t FastChat1，突然看到了报错信息 No module named ‘vllm’ 现在已经成功运行

谢谢

ziyuwan commented 1 month ago

No module named ‘vllm’

I think you should first make sure you have successfully installed vllm.

jeffyeylw commented 1 month ago

Once I ran the create_service_math_shepherd.sh it seems both services are running? Am I correct? 下面是ps -ef | grep open截图：

But very shortly, the llm service seems to close itself? And I got the same error as "requests.exceptions.MissingSchema: Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate?"

Then I check my running services again, I found out that it seems only RM services running. I am not sure why the other services close itself:

Brainth commented 1 month ago

我遇到了同样的问题

ziyuwan commented 1 month ago

Hi，maybe you can check the outputed logs or just attach into the tmux session by

tmux a -t FastChat

Brainth commented 1 month ago

解决方案： 1、学习相关tmux知识，用于定位，llm服务突然退出是因为报错了，进到tmux里面查看报错信息 2、我自己做的两个改动： a、去掉代理 tmux send-keys "unset http_proxy" Enter tmux send-keys "unset https_proxy" Enter b、根据报错信息注释掉 vllm_worker.py:104 use_beam_search=use_beam_search

jeffyeylw commented 1 month ago

解决方案： 1、学习相关tmux知识，用于定位，llm服务突然退出是因为报错了，进到tmux里面查看报错信息 2、我自己做的两个改动： a、去掉代理 tmux send-keys "unset http_proxy" Enter tmux send-keys "unset https_proxy" Enter b、根据报错信息注释掉 vllm_worker.py:104 use_beam_search=use_beam_search

Thanks that helps a lot!

我后续在log里发现报错地方在运行两个service的时候但是这个log里面并没有说明具体的报错原因：

ziyuwan commented 1 month ago

Do you mean the ERROR here? it's not an error. It's just an output by log.error(), can you please check the error message by attaching into the tmux session

xiaoweiweixiao commented 3 weeks ago

按照前面的方法，已成功运行create_service_qwen2.5_math_hf.sh，查看tmux日志，也没有报错，但是运行python reason/evaluation/evaluate.py，还是报错Invalid URL '/worker_generate': No scheme supplied. Perhaps you meant https:///worker_generate? 定位到报错的位置：text_generation.py 通过打印的worker_addr为空，少了网址的http，所以导致报错。请问，这个报错怎么解决？

ziyuwan commented 2 weeks ago

hi @xiaoweiweixiao have your problems been solved now? What's your create_llm_service scripts looks like? What's your --LM, and --RM during evaluating? Or can you provide more information by attaching in the tmux session and see the error message?

xiaoweiweixiao commented 1 week ago

hi @xiaoweiweixiao have your problems been solved now? What's your create_llm_service scripts looks like? What's your --LM, and --RM during evaluating? Or can you provide more information by attaching in the tmux session and see the error message?

感谢，已经解决，--LM 和 --RM中的参数要和log文件中的model name对应，只用写model name就行，不用写完整路径。

openreasoner / openr

When I execute sh scripts/eval/cot_greedy.sh, I get an error `requests.exception.MissingSchema: Invalid URL '/worker_generate': scheme not provided. Perhaps you mean https:///worker_generate?` #18

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

预期正常执行