thestephencasper / mechanistic_interpretability_challenge

8 stars 1 forks source link

Testing accuracy calculation has a bug #1

Open neverix opened 1 year ago

neverix commented 1 year ago

This is great work! I was running it and noticed that the testing accuracy computation code

logits = model(all_pairs)[:, -1]
model_labels = np.zeros((p, p))
for x in range(p):
    for y in range(p):
        if logits[x*p + y][0] < logits[x*p + y][1]:
            model_labels[x][y] = 1
# 0.9727464954185919
acc_on_test_half = 1 - 2 * np.mean(np.abs(ground_truth_labels - model_labels))

print('Accuracy on test half:', acc_on_test_half)

does not actually take into account if a point is in the training or testing set. Computing it correctly would require

is_train = np.zeros((p, p))
for x, y, _ in train:
    is_train[x][y] = 1
# 0.975724353954581
acc_on_test_half = 1 - np.mean(np.abs(ground_truth_labels - model_labels) * (1 - is_train)) / (1 - is_train).mean()
thestephencasper commented 1 year ago

Thanks, I wrote this under the assumption that the model gets all of the training examples correct and only fails on testing ones. This is not perfect, but I expect it to be either correct or off by just a few pixels in this case.

Good catch. Sadly, I am unlikely to go fix this for the sake of time and since it's small. Good luck w/ the challenge if you're working on it. And let me know if you have any questions.

On Fri, Mar 3, 2023 at 9:14 AM neverix @.***> wrote:

This is great work! I was running it and noticed that the testing calculation

logits = model(all_pairs)[:, -1]model_labels = np.zeros((p, p))for x in range(p): for y in range(p): if logits[xp + y][0] < logits[xp + y][1]: model_labels[x][y] = 1# 0.9727464954185919acc_on_test_half = 1 - 2 * np.mean(np.abs(ground_truth_labels - model_labels)) print('Accuracy on test half:', acc_on_test_half)

does not actually take into account if a point is in the training or testing set. Computing it correctly would require

istrain = np.zeros((p, p))for x, y, in train: is_train[x][y] = 1# 0.975724353954581acc_on_test_half = 1 - np.mean(np.abs(ground_truth_labels - model_labels) * (1 - is_train)) / (1 - is_train).mean()

— Reply to this email directly, view it on GitHub https://github.com/thestephencasper/mechanistic_interpretability_challenge/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5GZSK4KH5HQZDMZRFYH3TW2H4F3ANCNFSM6AAAAAAVOVHE54 . You are receiving this because you are subscribed to this thread.Message ID: <thestephencasper/mechanistic_interpretability_challenge/issues/1@ github.com>

neverix commented 1 year ago

I'm skeptical about how much can be interpreted from this model because it is not perfectly grokked. It's possible to train a model with 1 ReLU unit that gets 96+%, and that unit relies on just the first two principal components of the data to get that score.

thestephencasper commented 1 year ago

Thanks. In the post introducing the challenge, I wrote:

"Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with models don’t tend to conveniently grok onto a simple, elegant, programmatic solution to a problem."

To the extent that it's hard to reverse-engineer these networks because they aren't grokked, it will be hard to do MI on real world problems. This is part of what I wrote about in EIS VI https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7/p/wt7HXaCWzuKQipqz3 about how working on toy/cherrypicked problems where grokking can happen is not useful for developing methods that can be useful irl.

On Sat, Mar 4, 2023 at 9:21 AM neverix @.***> wrote:

I'm skeptical how much can be interpreted from these models because they are not perfectly grokked. It's possible to train a model with 1 ReLU unit that gets 96+%, and that unit relies on just the first two principal components of the data to get that score.

— Reply to this email directly, view it on GitHub https://github.com/thestephencasper/mechanistic_interpretability_challenge/issues/1#issuecomment-1454755430, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG5GZSKJPYP2WBFK4GLUJFLW2NFVDANCNFSM6AAAAAAVOVHE54 . You are receiving this because you commented.Message ID: <thestephencasper/mechanistic_interpretability_challenge/issues/1/1454755430 @github.com>

neverix commented 1 year ago

Yeah, it depends on if the model is accurate enough to extract the function - the training set is not perfectly labeled either. Is that why you bet that the full version of the challenge isn't solvable with just this network?

thestephencasper commented 1 year ago

I do not expect it to be unsolvable. But yes, I expect it to be hard.

When you say "if the model is accurate enough to extract the function," this seems to relate to a fundamental problem with some existing approaches to MI. If a method requires a model to have "grokked" in some sense, there is no reason to expect it to be useful in practice. This is related to some thoughts I have on how prior "mechanistic interpretability analysis of grokking" work may offer a harmfully impragmatic framing of the interpretability problem. The difficulty you point out was one of my goals with this sequene. We have no evidence that real models applied to non-toy problems grok in a similar way to what we can see in toy problems where a model gets "accurate enough to extract the function".

So for this problem and any other, I would not advise taking approaches that in some way require that a model has "grokked" in some sense. I think this problem can be solved. I can imagine a few experiments that would give some good evidence of what the labeling program is. But these experiments are not the type of thing done in the "mechanistic interpretability analysis of grokking" work. And this is part of the point of the challenge.

neverix commented 1 year ago

Would you agree that this set of weights is not sufficient to get the exact function generating the data?

Le 4 mars 2023 à 21:21, Stephen Casper @.***> a écrit :

I do not expect it to be unsolvable. But yes, I expect it to be hard.

When you say "if the model is accurate enough to extract the function," this seems to relate to a fundamental problem with some existing approaches to MI. If a method requires a model to have "grokked" in some sense, there is no reason to expect it to be useful in practice. This is related to some thoughts I have on how prior "mechanistic interpretability analysis of grokking" work may offer a harmfully impragmatic framing of the interpretability problem. The difficulty you point out was one of my goals with this sequene. We have no evidence that real models applied to non-toy problems grok in a similar way to what we can see in toy problems where a model gets "accurate enough to extract the function".

So for this problem and any other, I would not advise taking approaches that in some way require that a model has "grokked" in some sense. I think this problem can be solved. I can imagine a few experiments that would give some good evidence of what the labeling program is. But these experiments are not the type of thing done in the "mechanistic interpretability analysis of grokking" work. And this is part of the point of the challenge. — Reply to this email directly, view it on GitHub https://github.com/thestephencasper/mechanistic_interpretability_challenge/issues/1#issuecomment-1454811307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD3B7EFLS64QSBBXAQAEILW2N2YPANCNFSM6AAAAAAVOVHE54. You are receiving this because you authored the thread.

thestephencasper commented 1 year ago

Yes. Unarguably. Emphasis on exact. On Mar 4, 2023, at 12:58 PM, neverix @.***> wrote: Would you agree that this set of weights is not sufficient to get the exact function generating the data?

Le 4 mars 2023 à 21:21, Stephen Casper @.***> a écrit :

I do not expect it to be unsolvable. But yes, I expect it to be hard.

When you say "if the model is accurate enough to extract the function," this seems to relate to a fundamental problem with some existing approaches to MI. If a method requires a model to have "grokked" in some sense, there is no reason to expect it to be useful in practice. This is related to some thoughts I have on how prior "mechanistic interpretability analysis of grokking" work may offer a harmfully impragmatic framing of the interpretability problem. The difficulty you point out was one of my goals with this sequene. We have no evidence that real models applied to non-toy problems grok in a similar way to what we can see in toy problems where a model gets "accurate enough to extract the function".

So for this problem and any other, I would not advise taking approaches that in some way require that a model has "grokked" in some sense. I think this problem can be solved. I can imagine a few experiments that would give some good evidence of what the labeling program is. But these experiments are not the type of thing done in the "mechanistic interpretability analysis of grokking" work. And this is part of the point of the challenge. — Reply to this email directly, view it on GitHub https://github.com/thestephencasper/mechanistic_interpretability_challenge/issues/1#issuecomment-1454811307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD3B7EFLS64QSBBXAQAEILW2N2YPANCNFSM6AAAAAAVOVHE54. You are receiving this because you authored the thread.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

thestephencasper commented 1 year ago

The network does not implement the exact labeling function. It does implement something close to it. So the challenge is to use mechanistic analysis of the network in whatever way you would like to find the general idea of what is being computed and make up the difference to recover the program. If this is just not tractable, that’s bad news for circuits style MI. And that’s the point. On Mar 4, 2023, at 1:15 PM, Stephen Casper @.> wrote:Yes. Unarguably. Emphasis on exact. On Mar 4, 2023, at 12:58 PM, neverix @.> wrote: Would you agree that this set of weights is not sufficient to get the exact function generating the data?

Le 4 mars 2023 à 21:21, Stephen Casper @.***> a écrit :

I do not expect it to be unsolvable. But yes, I expect it to be hard.

When you say "if the model is accurate enough to extract the function," this seems to relate to a fundamental problem with some existing approaches to MI. If a method requires a model to have "grokked" in some sense, there is no reason to expect it to be useful in practice. This is related to some thoughts I have on how prior "mechanistic interpretability analysis of grokking" work may offer a harmfully impragmatic framing of the interpretability problem. The difficulty you point out was one of my goals with this sequene. We have no evidence that real models applied to non-toy problems grok in a similar way to what we can see in toy problems where a model gets "accurate enough to extract the function".

So for this problem and any other, I would not advise taking approaches that in some way require that a model has "grokked" in some sense. I think this problem can be solved. I can imagine a few experiments that would give some good evidence of what the labeling program is. But these experiments are not the type of thing done in the "mechanistic interpretability analysis of grokking" work. And this is part of the point of the challenge. — Reply to this email directly, view it on GitHub https://github.com/thestephencasper/mechanistic_interpretability_challenge/issues/1#issuecomment-1454811307, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALD3B7EFLS64QSBBXAQAEILW2N2YPANCNFSM6AAAAAAVOVHE54. You are receiving this because you authored the thread.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>