jbloomAus / DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks
https://jbloomaus-decisiontransformerinterpretability-app-4edcnc.streamlit.app/
MIT License
61 stars 15 forks source link

Write a post before EAG London #74

Closed jbloomAus closed 1 year ago

jbloomAus commented 1 year ago

I have some drafts, but want to make sure I get something out before EAG London / preferably no later than tomorrow night (May 16th).

Key things

Current draft

Resources:

jbloomAus commented 1 year ago

Ideas for framing/emphasis (framed as a post title?)

Trying to understand meaningful questions in

Short term:

Medium:

jbloomAus commented 1 year ago

Discussion with Jay led to deciding to go very simple and direct.

First pass at intro: Decision transformers are analogous to large language models but it's easier to ask questions about their goals than it is to ask questions about the goals of large language models. I've built a system to train these sorts of models and try to interpret them. Training these agents presents a number of challenges which we have partially overcome in order to produce the model which we analyse below, but we improve our ability to train interesting models in the future. Our analysis makes use of many previously published techniques and "live analysis". We uncover a number of interesting behaviors which we attempt to understand. We are particularly excited about the possibility of further studying goal representations as well agent-simulation.