need a better metrics other than acc and average score for the reward modeling step

DanqingZ commented 1 year ago

I've recently discovered that reward modeling plays a crucial role in the third step of PPO training. In my previous reward model, the score would increase with more "thanks" tokens, regardless of the input. Consequently, the PPO trainer exploited this by training the actor model to generate as many "thanks" tokens as possible.

To address this, I integrated wandb into my step 2 training and evaluation code and conducted some hyperparameter tuning. As illustrated in the plot below, the yellow line doesn't perform well, while the blue line exhibits the best performance.

Examining the evaluation results for the yellow line, it appears that the average score is decent, with an accuracy rate approaching 70%. However, it scores some really bad response with high scores as well.

Upon examining the blue line evaluation outcome, I realized that the crucial aspect is ensuring the reward model assigns high scores to good responses and low scores to bad ones, creating a significant gap. However, our existing metrics do not capture this aspect. I primarily relied on the accuracy metrics to determine whether my reward model had been trained effectively.

laoda513 commented 1 year ago

may i ask a naive quesiton... how to generate this?

DanqingZ commented 1 year ago

I utilized the wandb table, and you can configure wandb by following the steps outlined in this issue: https://github.com/microsoft/DeepSpeedExamples/issues/308. Then, incorporate the subsequent code snippet into your rw_eval.py file.

    example_table = wandb.Table(columns=["Prompt", "Good Response", "Bad Response", "Good Score", "Bad Score"])
    for prompt, good_ans, bad_ans in zip(prompt_list, good_ans_list,
                                         bad_ans_list):
        batch = prepare_datapair(prompt,
                                 good_ans,
                                 bad_ans,
                                 tokenizer,
                                 max_seq_len=512,
                                 end_of_conversation_token="<|endoftext|>")
        batch = to_device(batch, device)
        # Run inference
        with torch.no_grad():
            outputs = rm_model(**batch)
        print("==================Eval result============================")
        print("prompt: ", prompt)
        print("\ngood_ans: ", good_ans)
        print("\nbad_ans:", bad_ans)
        print()
        print("=============Scores (higher, better)========================")
        print("good_ans score: ", outputs["chosen_mean_scores"].item())
        print("bad_ans score: ", outputs["rejected_mean_scores"].item())
        good_score = outputs["chosen_mean_scores"].item()
        bad_score = outputs["rejected_mean_scores"].item()
        example_table.add_data(prompt, good_ans, bad_ans, str(good_score), str(bad_score))
    wandb.log({
        "examples": example_table,
    })

laoda513 commented 1 year ago

@DanqingZ thanks！

Dingjz commented 1 year ago

I meet the same question. I trained opt-1.3b on a few datasets listed in the training script. Then i run rw_eval.py, i got those results: It seems that if I simply repeat a word, I will get a very high score, much higher than the normal response. This will greatly affect the trend of step3 rlhf training. Leading to my step3 trained model no matter what I input, the output of the model is this simple repetition of a word.I wonder if there is a good solution.

EthenZhang commented 1 year ago

I meet the same question. I trained opt-1.3b on a few datasets listed in the training script. Then i run rw_eval.py, i got those results: It seems that if I simply repeat a word, I will get a very high score, much higher than the normal response. This will greatly affect the trend of step3 rlhf training. Leading to my step3 trained model no matter what I input, the output of the model is this simple repetition of a word.I wonder if there is a good solution.

same error as you.

minjiaz commented 1 year ago

@DanqingZ,

Thank you very much for your detailed report! This is quite helpful.

"I realized that the crucial aspect is ensuring the reward model assigns high scores to good responses and low scores to bad ones, creating a significant gap. However, our existing metrics do not capture this aspect." I totally agree with this. We have been investigating the step2 reward calculation and have observed a similar phenomenon, e.g., increasing the step2 accuracy sometimes does not guarantee an improved end-to-end generation quality. Meanwhile, please feel free to let us know if you have other metric candidates in mind we can add and test.

Best, Minjia

microsoft / DeepSpeedExamples

need a better metrics other than acc and average score for the reward modeling step #491