openai / automated-interpretability

953 stars 113 forks source link

Fix the values in NEWER_EXAMPLES #39

Closed hijohnnylin closed 8 months ago

hijohnnylin commented 8 months ago

The logprobfree simulator uses the FewShotExampleSet.NEWER, same set that ExplanationTokenByTokenSimulator uses. For scoring v1, we used ExplanationNeuronSimulator which used .ORIGINAL examples.

The issue with the .NEWER example set is that it's not giving high enough scores for relevant tokens. (Source)

Example Explanation: "the word “variant” and other words with the same ”vari” root"

Based on the example explanation, I expect the example scores to give the above tokens high values. But only one token (the first appearance of "Variant") is given a score of 4.2, and the rest are ~0, except one other which is 1.24.

This change updates the `vari-' tokens to be positive values. It also removes some instances of "negative zero" and small decimal values which seemed to confuse GPT.

Changes were tested by Neuronpedia's Score/Prompt Tuner.

hijohnnylin commented 8 months ago

@henktillman I can't seem to add reviewers so I'm tagging you instead.