Dataset examples identifier script

interpreting-rl-behavior / interpreting-rl-behavior.github.io

Code for the site https://interpreting-rl-behavior.github.io/

Creative Commons Attribution 4.0 International

0 stars 0 forks source link

Dataset examples are samples from the dataset where a neuron (or direction) is particularly high or low activation. There's some evidence (Borowski et al. 2020) that they are even more useful for interpretation than generative feature visualisation.

I've found them very useful for interpreting the agent. But currently I've been identifying them manually, which takes a fair bit of time.

It'd be great to have a script that returns a text file/csv/something else that has, for each IC, the ids of the samples where: IC X is high; IC X is middling; IC X is low

where each category correponds to the top 10%, the middle 10%, the bottom 10%. I'm suggesting 10% here, but maybe another threshold would work better.

It'd be good to have a separate list for when the activation is in the top/middle/bottom 10% on the timestep we're taking the gradient from. This is obviously a subset of the broader list. This separate list would be very useful for telling saliency stories.

As a sanity check, it'd also be nice (but not essential) to plot histograms of the activations across the samples in the dataset. It's a sanity check because it lets us determine whether 10% or some other threshold is a reasonable threshold. If, for instance, 5% of activations are very very high, but 95% are middle or low, then a threshold of 10% will include many samples where the activation isn't very high. It'd also just be nice to get a picture of the distributions of the activations for different ICs. But it's not essential.

Added code for storing extrema examples in commit 5255038. Code which create histograms is in commit f62774b in train-procgen-pytorch.

Overview of the current implementation is as follows:

A hyperpamater extrema_threshold is stored. Currently this is set to the single value 0.1, meaning we want the top, middle, bottom 10% of examples. We likely want to change this number, or even give multiple numbers (different for each top/middle/bottom).
in import_data.py, we calculate the extrema cutoff values on all hx (using the hx_in_pca variable). We then iterate through each sample and record whether any of its ica components cross the above thresholds. We also record whether its ica component at the saliency timestep crosses that threshold.
The list of top/middle/bottom samples for each component is stored in data/extrema.json.
Histograms of activations are stored in data/component_histograms. These give a very rough indication of what the distribution looks like, but we may want to improve them.

The outputs can be found in commit 6467408d in this repo.

interpreting-rl-behavior / interpreting-rl-behavior.github.io

Dataset examples identifier script #64