Once main interpretation is done, make a decision on whether to do 'adversarial environment' validation experiments

interpreting-rl-behavior / interpreting-rl-behavior.github.io

Code for the site https://interpreting-rl-behavior.github.io/

Creative Commons Attribution 4.0 International

0 stars 0 forks source link

Once main interpretation is done, make a decision on whether to do 'adversarial environment' validation experiments #37

Closed leesharkey closed 2 years ago

leesharkey commented 2 years ago

In activations atlases, OpenAI were able to hand-craft adversarial example for the conv net they were interpreting. This was a powerful demonstration of the capability of their interpretability method. If we can produce something similar, we probably should.

One way to do this might be to 1) deeply understand what's going on inside the agent then 2) modify the coinrun environment to demonstrate that understanding such that it produces unexpected (yet predicted) behaviour in the agent.

danbraunai commented 2 years ago

This might tie in nicely with the objective robustness stuff. The difference is that with the objective robustness experiments, you made an educated guess that the agent was actually looking for the end of the level as opposed to the coin. Whereas here, we first investigate what the agent is looking for, and then show by changing the env. I'm just mentioning this as it may be nice to write about in this section of the paper if we do these experiments.

leesharkey commented 2 years ago

We'll be validating our interpretations using ablations and I think that will be enough.