Question with regarding to training the reward model

TianjinTeda commented 1 year ago

Hi,

Thanks for this excellent work!

I am training a reward model based on the original llava-7b sft model, however, the loss does not decrease and the eval accuracy dose not increase. Do you have any idea what I might have done wrong?

Edward-Sun commented 1 year ago

Hi, does our original script in 13b-v1.5-336/train_reward_model.sh work in your side?

TianjinTeda commented 1 year ago

I am still managing to train the 13b model as I do not have A100 currently. What results could I expect from the 13b reward model (what should the evaluation accuracy be on the preference data)?

Edward-Sun commented 1 year ago

If you use our SFT-13b as the reward model, the first few steps would look like:

{'loss': 0.692, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.02}                                                                                      
{'loss': 0.6897, 'learning_rate': 2e-05, 'epoch': 0.04}                                                                                                      
{'loss': 0.6916, 'learning_rate': 2e-05, 'epoch': 0.05}                                                                                                      
{'loss': 0.6878, 'learning_rate': 2e-05, 'epoch': 0.07}                                                                                                      
{'loss': 0.6778, 'learning_rate': 2e-05, 'epoch': 0.09}                                                                                                      
{'loss': 0.6756, 'learning_rate': 2e-05, 'epoch': 0.11}                                                                                                      
{'loss': 0.64, 'learning_rate': 2e-05, 'epoch': 0.13}                                                                                                        
{'loss': 0.6635, 'learning_rate': 2e-05, 'epoch': 0.14}                                                                                                      
{'loss': 0.6475, 'learning_rate': 2e-05, 'epoch': 0.16}                                                                                                      
{'loss': 0.6117, 'learning_rate': 2e-05, 'epoch': 0.18}                                                                                                      
{'eval_loss': 0.6177946925163269, 'eval_accuracy': 0.6600000262260437, 'eval_label_positive_rate': 0.4860000014305115, 'eval_average_score': -0.2672645151615143, 'eval_runtime': 61.9716, 'eval_samples_per_second': 8.068, 'eval_steps_per_second': 0.258, 'epoch': 0.18}

The final eval_acc is also around 65-70%

TianjinTeda commented 1 year ago

Hi @Edward-Sun, thanks for your response! A last question, would 65-70% accuracy enough to work as a reward model?

Edward-Sun commented 1 year ago

Yes. This is enough. According to alpacafarm and our internal study, the held-out human agreement rate is typically also around 65 - 70%.

TianjinTeda commented 1 year ago

Appreciate!

llava-rlhf / LLaVA-RLHF

Question with regarding to training the reward model #16