Closed leesharkey closed 2 years ago
This might tie in nicely with the objective robustness stuff. The difference is that with the objective robustness experiments, you made an educated guess that the agent was actually looking for the end of the level as opposed to the coin. Whereas here, we first investigate what the agent is looking for, and then show by changing the env. I'm just mentioning this as it may be nice to write about in this section of the paper if we do these experiments.
We'll be validating our interpretations using ablations and I think that will be enough.
In activations atlases, OpenAI were able to hand-craft adversarial example for the conv net they were interpreting. This was a powerful demonstration of the capability of their interpretability method. If we can produce something similar, we probably should.
One way to do this might be to 1) deeply understand what's going on inside the agent then 2) modify the coinrun environment to demonstrate that understanding such that it produces unexpected (yet predicted) behaviour in the agent.