thestephencasper / mechanistic_interpretability_challenge

8 stars 1 forks source link

Mechanistic Interpretability Challenge

(SOLVED) Challenge 1, MNIST CNN:

Solution report: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: The labeling function can be described in words in one sentence.

Hint 4: This image may be helpful.

mnist example

MNIST CNN challenge: MNIST CNN challenge -- Colab

(Solved*) Challenge 2, Transformer:

*The challenge was not solved by finding the labeling function but instead by showing that finding the labeling function is bery unlikely to be tractable. In the report linked below, I am quoted with some thoughts about this.

Solution report: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

See also https://www.lesswrong.com/posts/vGCWzxP8ccAfqsrS3/thoughts-about-the-mechanistic-interpretability-challenge-2

Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

drawing

Transformer challenge: MNIST CNN challenge -- Colab

Rewards:

If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!