jbloomAus / DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks
https://jbloomaus-decisiontransformerinterpretability-app-4edcnc.streamlit.app/
MIT License
68 stars 16 forks source link

Mega Card: Improve Analysis App in various ways to facilitate better interpretability analysis of the new models #44

Open jbloomAus opened 1 year ago

jbloomAus commented 1 year ago

Analysis features

Static

Composition

Dynamic

Logit Lens

Attention Maps:

Causal

Activation Patching (features)

RTG Scan

Congruence -> If features aren't in superposition, what effect do they have on the predictions?

Renew old features:

SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing.

Cache Characterization?

Advanced

Implement Path Patching

Implement AVEC

Several things I feel are missing which are required for exploratory analysis to be more complete:

Several things I feel will be required for falsifying predictions of how the model is working:

jbloomAus commented 1 year ago

Would storing/calculating mean kurtosis of activations be interesting? https://transformer-circuits.pub/2023/privileged-basis/index.html

jbloomAus commented 1 year ago

On a wim I added basic history visualization. Main issues are:

  1. one hot encoded obs aren't amenable to visualization via co-opting the grid render method making this difficult. I just rendered the whole state view but this feels inaccurate/bad.
  2. indexing is a little messy with adjustment but I think I sorted it.

I also started time embedding dot product viz but didn't finish but I'll leave it there. It didn't seem super interesting.

jbloomAus commented 1 year ago

Plot L2 norm of residual streams (gives sense for amount of info in a layer as compared to the amount of info going into the logit).