tlc4418 / llm_optimization

A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.
https://arxiv.org/abs/2310.02743
MIT License
25 stars 1 forks source link

alpacafarm reward-model-human as gold reward #2

Open georgao35 opened 2 months ago

georgao35 commented 2 months ago

Hello, I have found your work and code exetremely helpful! Thank you for your code and work. However, during my usage, I found several point to be confusing and not sure if I am doing right, so I hope you can generously help me.

In particular, I am not sure about some point regarding using Alpacafarm/reward-model-human as gold reward model, as in the paper:

  1. After ppo training, when using Alpacafarm/reward-model-human to assign the gold reward for generated responses, I found i have to set is_alpaca_rm: true instead of false which is originially set in configs/config_rl.json.
  2. In that case, when using alpaca rms, I believe there's a typo in function _parse_entry of src/data_utils/rm_dataset_formatter.py. When using this function, the prompt would only contain the first line "Below is an instruction that describes a task, paired with an ", instead of the whole prompt used in alpaca farm. I fixed it by adding a \ at the end of each line, but I amnot sure if it is the right way.

It is deeply appreciated if you can help me with my problems. Thank you!

RylanSchaeffer commented 1 month ago

@georgao35 did you ever get clarity on these two questions?

JohannesAck commented 1 month ago

Also make sure to add the f for string fromatting. For me it looks like the following, but it would be great to hear from Thomas

def _parse_entry(entry: dict, output_alpaca: bool = False):
    instruction = entry["instruction"]
    input = entry.get("input", "")
    answers = entry["answers"]
    if output_alpaca:
        input_ = f"### Input:\n{input}\n\n" if len(input) > 0 else ""
        start_prompt = "Below is an instruction that describes a task, paired with an " \
        "input that provides further context. Write a response that appropriately " \
        f"completes the request.\n\n### Instruction:\n{instruction}\n\n{input_}### " \
        "Response:\n"
        end_token = ""
    else:
        input_ = f"\n{input}" if len(input) > 0 else ""
        start_prompt = f"<|prompter|>{instruction}{input_}<|endoftext|><|assistant|>"
        end_token = "<|endoftext|>"
    return start_prompt, answers, end_token
georgao35 commented 1 month ago

@georgao35 did you ever get clarity on these two questions?

not yet @RylanSchaeffer

tlc4418 commented 1 month ago

Apologies, I don't monitor this that often, will try to monitor more frequently moving forward. I was doing a lot of refactoring towards the end so there seem to be a couple typos, you are correct.

  1. This seems correct and I should change it in the config. When using the alpaca_farm reward model you will want to set that flag to true.
  2. You want this all to print out as a single string, so you will want to either add the \ at the end of each line or wrap the whole string in brackets (). I used to have it all on one line and didn't double-check this, my bad. And indeed you also need to add the "f" for the f-string as @JohannesAck suggested.

I will try to push a fix for these changes soon. In the meantime I hope my answer helps clarify things.