l294265421 / alpaca-rlhf

Finetuning LLaMA with RLHF (Reinforcement Learning with Human Feedback) based on DeepSpeed Chat
https://88aeeb3aef5040507e.gradio.live/
MIT License
103 stars 13 forks source link

stop at step2 evaluation_reward #5

Open murphypei opened 1 year ago

murphypei commented 1 year ago

Firstly, thank you for your contributions. I consistently pause (but do not exit) at the evaluation_reward during the training of step 2. Hence, I am wondering if there is something wrong. Perhaps the condition args.global_rank == 0 is unnecessary? Any suggestions would be greatly appreciated. Thank you.

murphypei commented 1 year ago

The following code works well.

if (step + 1) % 100 == 0:
    reward_score, rejected_scores, acc, score_std = evaluation_reward(rm_model, eval_dataloader)
    if args.global_rank == 0:
        wandb.log({
            'Eval/epoch': -1,
            'Eval/reward_score': reward_score,
            'Eval/score_std': score_std,
            'Eval/rejected_scores': rejected_scores,
            'Eval/acc': acc,
        })
l294265421 commented 1 year ago

The following code works well.

if (step + 1) % 100 == 0:
    reward_score, rejected_scores, acc, score_std = evaluation_reward(rm_model, eval_dataloader)
    if args.global_rank == 0:
        wandb.log({
            'Eval/epoch': -1,
            'Eval/reward_score': reward_score,
            'Eval/score_std': score_std,
            'Eval/rejected_scores': rejected_scores,
            'Eval/acc': acc,
        })

You are right. The condition args.global_rank == 0 has to be removed, since the evaluation_reward method needs all processes to participate.

l294265421 commented 1 year ago

In addition, there are another bug. The rm_model.train() should be put in the step loop: 图片

murphypei commented 1 year ago

In addition, there are another bug. The rm_model.train() should be put in the step loop: 图片

OK, thanks for your reply.