openai / automated-interpretability

940 stars 110 forks source link

Rewrite new simulator to use JSON mode; additional fixes to new simulator #34

Open hijohnnylin opened 8 months ago

hijohnnylin commented 8 months ago

This rewrites the new chat-based simulator to use JSON mode. New prompts and parsers were added to do this, and a new flag for Api_client json_mode. It works pretty well - in my testing it performed much more accurately than non-JSON mode, to the point where we are able to use gpt-3.5-turbo-1106 instead of gpt-4, resulting in massive cost and time savings. GPT4 took about 30 seconds, gpt-3.5-turbo-1106 takes about 10 seconds. JSON mode also eliminates the need for many of the response parsing edge cases.

This also sets temperature = 0 as originally intended by the documentation.

This pull request also fixes other edge cases, some of which apply to JSON mode as well:

1) GPT's response, which is a list of tokens and activations, often omits the space before tokens (seen in about ~40% of results). Currently the response parser considers this an invalid response and returns with zero activations for all tokens. This PR allows the first token to be missing the space and still be considered valid.

2) New simulator uses special character ༗\n as a unique separator between lines (since \n is too common). However, GPT sometimes (~5% of the time) doesn't return ༗\n and only returns \n. This causes the response parser to consider this an invalid response. This PR allows \n to be the separator in the case that ༗\n doesn't work. However, if an activation text encounters this edge case and also has \n, this fix won't work either. Better fix in the future is to re-query GPT as a followup e.g., ("it looks like you didn't include the ༗\n separator i originally included. can you try again?")

3) <|endoftext|> token in activation texts confuses the new chat-based simulator. Fix is to replace these occurrences with <|not_endoftext|>.

4) GPT sometimes gives a non-int activation (like 9.5 - it's never told that it needs to be an int). Since this allows more granularity it makes sense to allow it, so this PR changes int to float and enforces a value of 0 to 10 inclusive. Everything else is considered 0.

Some of these are fairly opinionated fixes, so feel free to exclude or alter them in any way you see fit.