Thanks for sharing you excellent research.
I'm training a full-finetuned reward model (without QLora) from "LLaVA-RLHF-13b-v1.5-336/sft_model" with LLaVA-Human-Preference-10K, and find the eval accuracy is around 63%-67%. This seems to be under expected as on NLP datasets the reward accuracy may around 75%.
Is this performance ehough for the RLHF pipeline or any intuition to revise this?
Thanks for sharing you excellent research. I'm training a full-finetuned reward model (without QLora) from "LLaVA-RLHF-13b-v1.5-336/sft_model" with LLaVA-Human-Preference-10K, and find the eval accuracy is around 63%-67%. This seems to be under expected as on NLP datasets the reward accuracy may around 75%. Is this performance ehough for the RLHF pipeline or any intuition to revise this?