Closed DripNowhy closed 2 months ago
I realized that I didn't add the reward_prompt after the input. When I used a standard reward_prompt, I found that approximately 65% of the chosen responses had a higher reward score than the rejected responses. Is this result acceptable? The way I add reward_prompt
def preprocess_reward_model(
source,
tokenizer,
):
conv = conversation_lib.default_conversation.copy()
conv.append_message(conv.roles[0], source["question"])
conv.append_message(conv.roles[1], source["answer"])
conversations = []
conversations.append(
conv.get_prompt() + source['reward_prompt'] + "</s>"
)
input_ids = torch.stack(
[
tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
for prompt in conversations
],
dim=0,
).cuda()
return input_ids[:, :-1]
Yeah 65% RM accuracy is pretty decent 👍
I realized that I didn't add the reward_prompt after the input. When I used a standard reward_prompt, I found that approximately 65% of the chosen responses had a higher reward score than the rejected responses. Is this result acceptable? The way I add reward_prompt
def preprocess_reward_model( source, tokenizer, ): conv = conversation_lib.default_conversation.copy() conv.append_message(conv.roles[0], source["question"]) conv.append_message(conv.roles[1], source["answer"]) conversations = [] conversations.append( conv.get_prompt() + source['reward_prompt'] + "</s>" ) input_ids = torch.stack( [ tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") for prompt in conversations ], dim=0, ).cuda() return input_ids[:, :-1]
Hey, May I know the reward prompt you used here? I also face the same question.
conversations.append(
conv.get_prompt() + source['reward_prompt'] + "</s>"
)
replace the source['reward_prompt']
with the prompt below, it seems to work for me
USER: Please evaluate the quality of your last response. There are several dimensions you should consider in your evaluation:
1. Accurate: The AI should provide factual and accurate information from the image, and refrain from making statements that are not supported by the image or inconsistent with the image.
2. Helpful: The AI’s response should precisely serve the user's needs and interests, while grounding the response in the image.
3. Language Natural: The AI should employ language that flows smoothly and is free from repetitive or awkward constructs.
4. Concise: The AI should efficiently address the task or answer the question, communicating the necessary information with brevity and clarity.
A good response should be accurate, helpful, language natural, and concise. ASSISTANT: Following your definitions, the quality score of my last response is
Hi, when I directly apply your reward model to the published preference dataset, the scores seem strange and are always <0. I recorded the score of chosen response and rejected response, and found that the chosen score is not higher than rejected response in most data. The way I use the reward model is
Thank you!