tatsu-lab / alpaca_eval

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
https://tatsu-lab.github.io/alpaca_eval/
Apache License 2.0
1.41k stars 219 forks source link

Details on Training GLM for Length-Controlled Winrate #346

Closed yix8 closed 1 month ago

yix8 commented 2 months ago

Thank you for this fantastic project. However, after reading the paper and reviewing the GitHub repository, I have the following questions: image

  1. In your paper, you proposed the above equation, where γ_x is the instruction difficulty term shared across models, which is also pre-computed and stored in the "instruction_difficulty.csv " file. I would like to understand how these parameters are trained. The paper mentions estimating this by fitting a joint regression across all models with the ψ_m - \ψ_b term fixed to one. Can this be interpreted as first training the GLM on all models in the leaderboard to estimate it? After that, we could fit θ, ϕ, ψ using only the preference model we are interested in. Could you please provide the training code or a detailed description of the process used to train these parameters?
  2. I also noticed that AlpacaEval provides the interface to compute the preference model's win rate against different reference models by modifying the --reference_outputs flag. However, in the source code, even if we change the reference model, the same instruction difficulty term is used during model fitting. Therefore, I am wondering if we need to train a new instruction difficulty term when using different baseline models. My understanding is that the training dataset used to fit the instruction difficulty term includes std_delta_len, which depends on the baseline. Therefore, should we create a new dataset based on the new baseline to fit the instruction difficulty term when the baseline changes?
  3. After loading the instruction difficulty term, we need to fit the above logistic regression for the preference model. I noticed that the code then uses the results from gpt4_1106_preview_concise and gpt4_1106_preview_verbose stored in "df_gamed.csv" to regularize the model. Could you please give some intuition on why these gamed baselines are needed as part of the training set? Additionally, if we use such regularization, should we also create new gamed baselines when the baseline/reference model changes, instead of still using gpt4_1106_preview_concise/gpt4_1106_preview_verbose?

Thank you for your time and help.

zhouku92 commented 1 month ago

+1 on these questions.

For the question#3 above, it seems that "dfgamed.csv" contains the so-called "gamed" baselines. Does it intend to make GLM more robust against "token length"? In other words, question#3 corresponds to the "additional weak regularization on $\phi{m,b}$" mentioned in the paper, am I right?

YannDubs commented 1 month ago

Hi @yix8,

Sorry for the delayed response; I've been very busy recently.

Before answering your questions, it's worth noting that I tried fitting instruction_difficulty in many different sensible ways, and it made nearly no difference in the final results. So, I see this as a simple but useful feature for the GLM. In particular, none of the properties of the LC AlpacaEval depend on how the instruction_following feature is fitted. Properties come from the multiplication by ψ_m - ψ_b. Even using random features would maintain all the desired properties, although the correlation with LMSYS decreases in this case from 98% to approximately 95%.

  1. Here's a notebook for computing instruction_difficulty. Concerning your question about "Can this be interpreted as first training the GLM on all models in the leaderboard to estimate it," the answer is "not quite" because the (ψ_m - ψ_b) factor is set to 1. This is because jointly training a multiplicative term makes the problem much harder to optimize (not convex). So we first fix (ψ_m - ψ_b) and train instruction_following, and then do the opposite to make the problem convex and possible to fit each model independently.
  2. There are two ways of computing the score with a different baseline. The first is to use the GLM to predict the preference that the LLM-judge would have preferred as shown in figure 5 of the paper. You can do that easily using --metric_kwargs "{'baseline':'Meta-Llama-3-8B-Instruct'}". The second is to actually annotate using a different baseline, i.e, using --reference_outputs. Ideally, you would indeed either refit or remove the instruction_difficulty term and either remove or change the regularization term towards GPT4 baseline. Given what I said above about instruction_difficulty being simply a good feature, I would suggest keeping the same instruction_difficulty but removing regularization using --metric_kwargs "{'glm_name':'length_controlled_noreg'}". To remove both instruction_difficulty and regularization you can use --metric_kwargs "{'glm_name':'length_controlled_minimal'}".
  3. As mentioned by @zhouku92, this corresponds to "additional weak regularization on ϕ_mb". In particular, it regularizes ϕ_mb closer to zero. As mentioned in the paper, this has nearly no effect on average case models but avoids adversarial problems where someone submits a model where all outputs that are not preferred are trimmed down to a few characters while the other ones are kept. The GLM learns a very large ϕ. If you are not building a leaderboard that people might try to game, I would drop that for simplicity. To see the effect of having that, try replacing metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula, regularize_to_baseline_lambda=0.2) with metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula) in the notebook (i.e., remove the regularization) and you will see that basically only the adversarial metrics change.
yix8 commented 1 month ago

Hi @yix8,

Sorry for the delayed response; I've been very busy recently.

Before answering your questions, it's worth noting that I tried fitting instruction_difficulty in many different sensible ways, and it made nearly no difference in the final results. So, I see this as a simple but useful feature for the GLM. In particular, none of the properties of the LC AlpacaEval depend on how the instruction_following feature is fitted. Properties come from the multiplication by ψ_m - ψ_b. Even using random features would maintain all the desired properties, although the correlation with LMSYS decreases in this case from 98% to approximately 95%.

  1. Here's a notebook for computing instruction_difficulty. Concerning your question about "Can this be interpreted as first training the GLM on all models in the leaderboard to estimate it," the answer is "not quite" because the (ψ_m - ψ_b) factor is set to 1. This is because jointly training a multiplicative term makes the problem much harder to optimize (not convex). So we first fix (ψ_m - ψ_b) and train instruction_following, and then do the opposite to make the problem convex and possible to fit each model independently.
  2. There are two ways of computing the score with a different baseline. The first is to use the GLM to predict the preference that the LLM-judge would have preferred as shown in figure 5 of the paper. You can do that easily using --metric_kwargs "{'baseline':'Meta-Llama-3-8B-Instruct'}". The second is to actually annotate using a different baseline, i.e, using --reference_outputs. Ideally, you would indeed either refit or remove the instruction_difficulty term and either remove or change the regularization term towards GPT4 baseline. Given what I said above about instruction_difficulty being simply a good feature, I would suggest keeping the same instruction_difficulty but removing regularization using --metric_kwargs "{'glm_name':'length_controlled_noreg'}". To remove both instruction_difficulty and regularization you can use --metric_kwargs "{'glm_name':'length_controlled_minimal'}".
  3. As mentioned by @zhouku92, this corresponds to "additional weak regularization on ϕ_mb". In particular, it regularizes ϕ_mb closer to zero. As mentioned in the paper, this has nearly no effect on average case models but avoids adversarial problems where someone submits a model where all outputs that are not preferred are trimmed down to a few characters while the other ones are kept. The GLM learns a very large ϕ. If you are not building a leaderboard that people might try to game, I would drop that for simplicity. To see the effect of having that, try replacing metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula, regularize_to_baseline_lambda=0.2) with metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula) in the notebook (i.e., remove the regularization) and you will see that basically only the adversarial metrics change.

All make sense, thank you very much!