tlc4418 / llm_optimization

A repo for RLHF training and BoN over LLMs, with support for reward model ensembles.
https://arxiv.org/abs/2310.02743
MIT License
26 stars 1 forks source link

How to re-implement the score-KL curve? #1

Closed zetian1025 closed 1 month ago

zetian1025 commented 6 months ago

Am I right:

  1. First sample D test prompts, then for each checkpoint (maybe saved with a fixed interval) generating one model output for each test prompt;
  2. Performing sentence-level KL with a forward step using both policy model and reference model and then calculating scores for the D outputs (generated by the policy model);
  3. Averaging the D results and get one point in the scatter diagram (as Figure 4,5)
JohannesAck commented 1 month ago

Hi @zetian1025 , did you find a good source for this question? I've been wondering about the same thing

tlc4418 commented 1 month ago

Apologies for the delay in answering, will try to check the issues more often in the future.

What @zetian1025 proposed sounds correct. The idea is to select n (e.g. 1000) test prompts, and perform a forward step for all of these, calculating the KL divergence between the current policy and the initial (reference) one, as well as the score for each prompt, and averaging over all n prompts.