interpreting-rl-behavior / interpreting-rl-behavior.github.io

Code for the site https://interpreting-rl-behavior.github.io/
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Dataset examples identifier script #64

Closed leesharkey closed 2 years ago

leesharkey commented 2 years ago

Dataset examples are samples from the dataset where a neuron (or direction) is particularly high or low activation. There's some evidence (Borowski et al. 2020) that they are even more useful for interpretation than generative feature visualisation.

I've found them very useful for interpreting the agent. But currently I've been identifying them manually, which takes a fair bit of time.

It'd be great to have a script that returns a text file/csv/something else that has, for each IC, the ids of the samples where: IC X is high; IC X is middling; IC X is low

where each category correponds to the top 10%, the middle 10%, the bottom 10%. I'm suggesting 10% here, but maybe another threshold would work better.

It'd be good to have a separate list for when the activation is in the top/middle/bottom 10% on the timestep we're taking the gradient from. This is obviously a subset of the broader list. This separate list would be very useful for telling saliency stories.

As a sanity check, it'd also be nice (but not essential) to plot histograms of the activations across the samples in the dataset. It's a sanity check because it lets us determine whether 10% or some other threshold is a reasonable threshold. If, for instance, 5% of activations are very very high, but 95% are middle or low, then a threshold of 10% will include many samples where the activation isn't very high. It'd also just be nice to get a picture of the distributions of the activations for different ICs. But it's not essential.

danbraunai commented 2 years ago

Added code for storing extrema examples in commit 5255038. Code which create histograms is in commit f62774b in train-procgen-pytorch.

Overview of the current implementation is as follows:

The outputs can be found in commit 6467408d in this repo.