Note: For both 7b and 13b policy models, we use the same 13b reward model. We also provide the pretrained reward model checkpoint at LLaVA-RLHF-13b-v1.5-336/rm_lora_adapter_model. To use the pretrained LoRA checkpoint, the base_model_name_or_path in adapter_config.json need to be modified to the actual path of the SFT model.
Hi, The RM is released:
https://github.com/llava-rlhf/LLaVA-RLHF/tree/main/RLHF#1-training-the-instruction-following-reward-model
Note: For both 7b and 13b policy models, we use the same 13b reward model. We also provide the pretrained reward model checkpoint at LLaVA-RLHF-13b-v1.5-336/rm_lora_adapter_model. To use the pretrained LoRA checkpoint, the base_model_name_or_path in adapter_config.json need to be modified to the actual path of the SFT model.