Possible Out of Index bug when reasoning with Qwen

Dada-Cloudzxy commented 2 weeks ago

System Info

python==3.10.15 cuda==11.8-8.8.1 torch==2.4.0 The latest version of code GPU A100_40G * 8

Who can help?

@ziyuwan @Gebro13 @mengfn @gzqaq @YanSong97 @i

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the codebase (such as scrips/, ...)
[ ] My own task or dataset (give details below)

Reproduction

run prm/code/fintune_qwen.py to get prm based on Qwen2.5-Math-1.5B. (math-shepherd is worked)
run scripts create_service_qwen2.5_math_hf.sh to start service for eval. (NUM_LM_WORKER=2, NUM_RM_WORKER=2)
run cot_greey.sh for eval Math dataset

When I set NUM_LM_WORKER=1, NUM_RM_WORKER=1, it could work successfully. 91%|███████████████████████████████▊ | 454/500 [1:33:18<06:55, 9.04s/it]

But set NUM_LM_WORKER=2, NUM_RM_WORKER=2, it could fail. 0%|▏ | 1/500 [00:10<1:28:42, 10.67s/it] Traceback (most recent call last): ... json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I try to locate error, so I print some message near llm_service/worker/inference.py line:190
```
if temperature < 1e-5 or top_p < 1e-8:  # greedy
    tmp, indices = torch.topk(last_token_logits, 2)
    tokens = [int(index) for index in indices.tolist()]

    if tokens[0] > 151936:
        print(f"prompt:{prompt}")
        print(f"tmp:{tmp}")
        print(f"last_token_logits:{last_token_logits} {last_token_logits.shape}")
        print(f"indices:{indices}")
        print(f"tokens:{tokens}")
```
I got the wrong information in all four experiments with different prompts. (Out of Index Error)

2024-11-09 12:21:48 | INFO | stdout | prompt:<|im_start|>system 2024-11-09 12:21:48 | INFO | stdout | Please reason step by step, and put your final answer within \boxed{{}}.<|im_end|> 2024-11-09 12:21:48 | INFO | stdout | <|im_start|>user 2024-11-09 12:21:48 | INFO | stdout | If $f(x) = \frac{3x-2}{x-2}$, what is the value of $f(-2) +f(-1)+f(0)$? Express your answer as a common fraction.<|im_end|> 2024-11-09 12:21:48 | INFO | stdout | <|im_start|>assistant 2024-11-09 12:21:48 | INFO | stdout | 2024-11-09 12:21:48 | INFO | stdout | tmp:tensor([nan, nan], device='cuda:0', dtype=torch.float16) 2024-11-09 12:21:48 | INFO | stdout | last_token_logits:tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', 2024-11-09 12:21:48 | INFO | stdout | dtype=torch.float16) torch.Size([151936]) 2024-11-09 12:21:48 | INFO | stdout | indices:tensor([9223231297218904063, 9223231297218904063], device='cuda:0') 2024-11-09 12:21:48 | INFO | stdout | tokens:[9223231297218904063, 9223231297218904063]

2024-11-09 12:23:17 | INFO | stdout | prompt:<|im_start|>system 2024-11-09 12:23:17 | INFO | stdout | Please reason step by step, and put your final answer within \boxed{{}}.<|im_end|> 2024-11-09 12:23:17 | INFO | stdout | <|im_start|>user 2024-11-09 12:23:17 | INFO | stdout | If $f(x) = \frac{3x-2}{x-2}$, what is the value of $f(-2) +f(-1)+f(0)$? Express your answer as a common fraction.<|im_end|> 2024-11-09 12:23:17 | INFO | stdout | <|im_start|>assistant 2024-11-09 12:23:17 | INFO | stdout | 2024-11-09 12:23:17 | INFO | stdout | tmp:tensor([ 0.3206, -0.0068], device='cuda:0', dtype=torch.float16) 2024-11-09 12:23:17 | INFO | stdout | last_token_logits:tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', 2024-11-09 12:23:17 | INFO | stdout | dtype=torch.float16) torch.Size([151936]) 2024-11-09 12:23:17 | INFO | stdout | indices:tensor([571746046575616, 580542139599872], device='cuda:0') 2024-11-09 12:23:17 | INFO | stdout | tokens:[571746046575616, 580542139599872]

2024-11-09 12:25:45 | INFO | stdout | prompt:<|im_start|>system 2024-11-09 12:25:45 | INFO | stdout | Please reason step by step, and put your final answer within \boxed{{}}.<|im_end|> 2024-11-09 12:25:45 | INFO | stdout | <|im_start|>user 2024-11-09 12:25:45 | INFO | stdout | What is the smallest positive perfect cube that can be written as the sum of three consecutive integers?<|im_end|> 2024-11-09 12:25:45 | INFO | stdout | <|im_start|>assistant 2024-11-09 12:25:45 | INFO | stdout | 2024-11-09 12:25:45 | INFO | stdout | tmp:tensor([nan, nan], device='cuda:0', dtype=torch.float16) 2024-11-09 12:25:45 | INFO | stdout | last_token_logits:tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', 2024-11-09 12:25:45 | INFO | stdout | dtype=torch.float16) torch.Size([151936]) 2024-11-09 12:25:45 | INFO | stdout | indices:tensor([9223231297218904063, 9223231297218904063], device='cuda:0') 2024-11-09 12:25:45 | INFO | stdout | tokens:[9223231297218904063, 9223231297218904063]

2024-11-09 12:27:39 | INFO | stdout | prompt:<|im_start|>system 2024-11-09 12:27:39 | INFO | stdout | Please reason step by step, and put your final answer within \boxed{{}}.<|im_end|> 2024-11-09 12:27:39 | INFO | stdout | <|im_start|>user 2024-11-09 12:27:39 | INFO | stdout | What is the smallest positive perfect cube that can be written as the sum of three consecutive integers?<|im_end|> 2024-11-09 12:27:39 | INFO | stdout | <|im_start|>assistant 2024-11-09 12:27:39 | INFO | stdout | 2024-11-09 12:27:39 | INFO | stdout | tmp:tensor([nan, nan], device='cuda:0', dtype=torch.float16) 2024-11-09 12:27:39 | INFO | stdout | last_token_logits:tensor([nan, nan, nan, ..., nan, nan, nan], device='cuda:0', 2024-11-09 12:27:39 | INFO | stdout | dtype=torch.float16) torch.Size([151936]) 2024-11-09 12:27:39 | INFO | stdout | indices:tensor([9223231297218904063, 9223231297218904063], device='cuda:0') 2024-11-09 12:27:39 | INFO | stdout | tokens:[9223231297218904063, 9223231297218904063]

Expected behavior

The expectation is to be able to run scripts properly.

ziyuwan commented 2 weeks ago

hi @Dada-Cloudzxy

Thank you for discovering this issue.

run scripts create_service_qwen2.5_math_hf.sh to start service for eval. (NUM_LM_WORKER=2, NUM_RM_WORKER=2)

The hf model runner was adapted from Qwen-Math, it is still an inefficient version, so a quick solution is to use vllm API instead, can you use vLLM by running create_service_qwen2.5_math_vllm.sh?

In the meanwhile, we will try to fix this bug.

Dada-Cloudzxy commented 2 weeks ago

Thank you very much, I will give it a try

YanSong97 commented 1 week ago

Thank you very much, I will give it a try

Hi, I failed to reproduce your error following the instructions. Could you provide more information regarding the error message? Such as the error message in each worker session. Many thanks.

Comma0103 commented 1 week ago

@YanSong97 Hi, I met the same bug, the only difference was model. Concretely, I finetuned Meta-Llama-3-8B using my prm/code/finetune_llama.py to get PRM (llama3_prm_checkpoint-6358) and use Meta-Llama-3-8B-Instruct for reason eval on MATH. The modified code to reproduce the bug I met is in my forked repo: Repo Link, Meta-Llama-3-8B and Meta-Llama-3-8B-Instruct were downloaded from huggingface, PRM (llama3_prm_checkpoint-6358) checkpoint and steps to reproduce it are in release of my forked repo: ckpt link.

Here are some extra detailed info may be useful:

Bug info

Sys info

python==3.10.15 cuda==11.6 torch==2.4.0 The latest version of code GPU A6000_48G * 4

Reproduction

download or reproduce PRM (llama3_prm_checkpoint-6358) checkpoint
run script bash reason/llm_service/create_service_llama3_8b_instruct_hf.sh in my forked repo to start service for eval (NUM_LM_WORKER=2, NUM_RM_WORKER=2, the bug still occurs when these two workers set to 1)
run script bash scripts/eval/beam_search_MATH_llama3_8b_instruct.sh in my forked repo to eval Math dataset

openreasoner / openr