Closed salemohamedo closed 2 years ago
Hi @salemohamedo, thanks for the note!
The "probabilities" listed in these files are actually negative log probabilities. In other words, if x
is the value printed, e ** -x
retrieves the original probability. It follows that the logic for comparing probs is inverted.
This is done to avoid problems with floating point error; very small probabilities are hard to represent in raw form.
Perhaps we should have made it clearer in the naming convention, sorry! Let me know if you have any other questions.
Gotcha, that makes sense. So I take it we want pre_rewrite_success
to be low in that case...
I'm using this code as part of a project to see how model editing techniques perform on distilled models. I made a few tweaks to add support for distilgpt2 and then tested out ROME on a subset of CF data that distilgpt2 already predicts the true responses for. I noticed that the post rewrite success rate dropped considerably - 99% (gpt2-xl) -> 1.15% (distilgpt2). Aside from potential bugs in my code (a likely possibility), any intuition on why ROME might not function as well on smaller models?
So I take it we want pre_rewrite_success to be low in that case...
Yep. But you'll notice that this value is non-negligible in our results table; GPT sometimes guesses the counterfactual correctly, since it doesn't know the original fact.
any intuition on why ROME might not function as well on smaller models?
Hm, seeing low rewrite efficacy is strange. This is probably due to a bug (or miscalibrated hyperparameters, or both), and here's why I say that: the ROME update is partially the result of an optimization loop (see rome/compute_v.py
). Gradient descent is quite strong, and it almost always finds a solution where the efficacy is high. But high rewrite efficacy is like high success on the training dataset; it's a sanity check to make sure the update hasn't totally underfit. To evaluate performance more extensively, you'll want to look at other metrics.
We haven't experimented extensively with small models because they simply aren't too great off-the-shelf :) But once the rewrite efficacy looks right and you've tuned hyperparameters (in particular, weight decay, learning rate), do keep us posted on what happens!
Note that we've found the update efficacy to be somewhat dependent on the norm of the update. If you're confident that your code is bug-free, maybe try decreasing weight decay (i.e. increasing the norm of the update) to see what happens. There's usually a sweet spot.
Got it, thanks very much for the help!
Hi, I've enjoyed playing with ROME and appreciate the interactive colab notebooks! I tried it out myself using
gpt2-xl
and I'm running into some strange behavior. Below, I've pasted the JSON for one of the case results (756) using ROME.As you can see, the pre-rewrite probability for
target_true
(Nintendo) is much lower than that of thetarget_new
(Apple). Shouldn't it be the other way around? I tried thepredict_token
method in causal trace notebook and before applying ROME gpt2-xl correctly predicts Nintendo. Additionally, the post re-write probs seem to be incorrect as well. Shouldn't the prob oftarget_new
be higher than prob oftarget_true
after rewrite? I found the same behavior over the majority of other cases I tested as well (I tested a batch of 350). I'm not sure if I'm misunderstanding something, so just looking to clarify that.Another question I had is regarding this line of code. Don't we want
x["target_true"] > x["target_new"]
only to be true forpre
and the inverse to be true forpost
?Any clarification would be appreciated, thanks!