Closed yix8 closed 1 month ago
+1 on these questions.
For the question#3 above, it seems that "dfgamed.csv" contains the so-called "gamed" baselines. Does it intend to make GLM more robust against "token length"? In other words, question#3 corresponds to the "additional weak regularization on $\phi{m,b}$" mentioned in the paper, am I right?
Hi @yix8,
Sorry for the delayed response; I've been very busy recently.
Before answering your questions, it's worth noting that I tried fitting instruction_difficulty in many different sensible ways, and it made nearly no difference in the final results. So, I see this as a simple but useful feature for the GLM. In particular, none of the properties of the LC AlpacaEval depend on how the instruction_following feature is fitted. Properties come from the multiplication by ψ_m - ψ_b. Even using random features would maintain all the desired properties, although the correlation with LMSYS decreases in this case from 98% to approximately 95%.
--metric_kwargs "{'baseline':'Meta-Llama-3-8B-Instruct'}"
. The second is to actually annotate using a different baseline, i.e, using --reference_outputs
. Ideally, you would indeed either refit or remove the instruction_difficulty term and either remove or change the regularization term towards GPT4 baseline. Given what I said above about instruction_difficulty being simply a good feature, I would suggest keeping the same instruction_difficulty but removing regularization using --metric_kwargs "{'glm_name':'length_controlled_noreg'}"
. To remove both instruction_difficulty and regularization you can use --metric_kwargs "{'glm_name':'length_controlled_minimal'}"
.metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula, regularize_to_baseline_lambda=0.2)
with metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula)
in the notebook (i.e., remove the regularization) and you will see that basically only the adversarial metrics change.Hi @yix8,
Sorry for the delayed response; I've been very busy recently.
Before answering your questions, it's worth noting that I tried fitting instruction_difficulty in many different sensible ways, and it made nearly no difference in the final results. So, I see this as a simple but useful feature for the GLM. In particular, none of the properties of the LC AlpacaEval depend on how the instruction_following feature is fitted. Properties come from the multiplication by ψ_m - ψ_b. Even using random features would maintain all the desired properties, although the correlation with LMSYS decreases in this case from 98% to approximately 95%.
- Here's a notebook for computing instruction_difficulty. Concerning your question about "Can this be interpreted as first training the GLM on all models in the leaderboard to estimate it," the answer is "not quite" because the (ψ_m - ψ_b) factor is set to 1. This is because jointly training a multiplicative term makes the problem much harder to optimize (not convex). So we first fix (ψ_m - ψ_b) and train instruction_following, and then do the opposite to make the problem convex and possible to fit each model independently.
- There are two ways of computing the score with a different baseline. The first is to use the GLM to predict the preference that the LLM-judge would have preferred as shown in figure 5 of the paper. You can do that easily using
--metric_kwargs "{'baseline':'Meta-Llama-3-8B-Instruct'}"
. The second is to actually annotate using a different baseline, i.e, using--reference_outputs
. Ideally, you would indeed either refit or remove the instruction_difficulty term and either remove or change the regularization term towards GPT4 baseline. Given what I said above about instruction_difficulty being simply a good feature, I would suggest keeping the same instruction_difficulty but removing regularization using--metric_kwargs "{'glm_name':'length_controlled_noreg'}"
. To remove both instruction_difficulty and regularization you can use--metric_kwargs "{'glm_name':'length_controlled_minimal'}"
.- As mentioned by @zhouku92, this corresponds to "additional weak regularization on ϕ_mb". In particular, it regularizes ϕ_mb closer to zero. As mentioned in the paper, this has nearly no effect on average case models but avoids adversarial problems where someone submits a model where all outputs that are not preferred are trimmed down to a few characters while the other ones are kept. The GLM learns a very large ϕ. If you are not building a leaderboard that people might try to game, I would drop that for simplicity. To see the effect of having that, try replacing
metrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula, regularize_to_baseline_lambda=0.2)
withmetrics, models = disjoint_optimization_(lb, df, df_lb, formula=formula)
in the notebook (i.e., remove the regularization) and you will see that basically only the adversarial metrics change.
All make sense, thank you very much!
Thank you for this fantastic project. However, after reading the paper and reviewing the GitHub repository, I have the following questions:
Thank you for your time and help.