训练强化学习RL Training（train_math.py）报错

ChenLong-UCAS commented 1 month ago

System Info

Command：python -u train_math.py \ --dataset_path "./math_500.jsonl" \ --model_name_or_path "./Qwen2.5-Math-1.5B" \
--prm_model_name_or_path "./Qwen2.5-Math-7B-Instruct" \ --algorithm_name "APPO" \ --num_mini_batch 4 \ --ppo_epoch 1

报错现象：

Traceback (most recent call last): File "/data1/c00841194/pycharm_project/openr-main/train/mat/scripts/train_math.py", line 108, in main(sys.argv[1:]) File "/data1/c00841194/pycharm_project/openr-main/train/mat/scripts/train_math.py", line 100, in main runner.run() File "/data1/c00841194/pycharm_project/openr-main/train/mat/runner/shared/math_runner.py", line 76, in run rewards = self.prm.get_reward(obs, actions) File "/root/anaconda3/envs/openr/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, kwargs) File "/data1/c00841194/pycharm_project/openr-main/train/mat/models/ms_prm.py", line 40, in get_reward last_step_score = step_score[-1]** IndexError: index -1 is out of bounds for dimension 0 with size 0

debug信息：ms_prm.py: line42 step_score = score[i][input_ids["input_ids"][i] == self.step_tag_id] # step_score tensor为空

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the codebase (such as scrips/, ...)
[ ] My own task or dataset (give details below)

Reproduction

1.按照markdown部署环境 2.下载qwen 1.5b、7b数据集 3.运行python -u train_math.py

Expected behavior

报错消失

morning9393 commented 1 month ago

这个应该是因为给prm的输入里没有step tag，感觉可以打印下给prm的输入看看？

prm是根据step tag来判断什么时候需要输出reward的，然后取最后一个step的reward作为当前reasoning step的反馈。

wusijie123 commented 1 month ago

https://github.com/openreasoner/openr/blob/fd6ff6c90072147af7114747cfb2110913e64ff7/train/mat/models/ms_prm.py#L32

这一行代码在step tag前面加了一个空格，导致了token的变化

>>> tokenizer.encode('ки') [16748] >>> tokenizer.encode(' ки') [7665, 1802]

ChenLong-UCAS commented 1 month ago

https://github.com/openreasoner/openr/blob/fd6ff6c90072147af7114747cfb2110913e64ff7/train/mat/models/ms_prm.py#L32

这一行代码在step tag前面加了一个空格，导致了token的变化

>>> tokenizer.encode('ки') [16748] >>> tokenizer.encode(' ки') [7665, 1802]

请教一下，如何修复这个bug呢？我还没找到哪里在self.step_tag前加了空格

ChenLong-UCAS commented 1 month ago

https://github.com/openreasoner/openr/blob/fd6ff6c90072147af7114747cfb2110913e64ff7/train/mat/models/ms_prm.py#L32

这一行代码在step tag前面加了一个空格，导致了token的变化 >>> tokenizer.encode('ки') [16748] >>> tokenizer.encode(' ки') [7665, 1802]

请教一下，如何修复这个bug呢？我还没找到哪里在self.step_tag前加了空格

 inputs_for_prm.append(f"{o}{a}{self.step_tag}")  #删除{a} {self.step_tag}之间的空格

Symbolk commented 3 weeks ago

咦，为啥这个修复了却没有提交PR或Commit嘞，直接clone下来跑不了还是挫败的~

openreasoner / openr