SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing.

jbloomAus / DecisionTransformerInterpretability

Interpreting how transformers simulate agents performing RL tasks

MIT License

73 stars 16 forks source link

This post is awesome. I think the value from using this method comes from both understanding the method better, understanding our models better and the editing could be cool too.

"highly interpretable semantic clusters" sound very cool. "Directly editing SVD representations" sounds very cool too.

Steps:

[ ] Make an SVD component in static viz.
[ ] Get a topk tokens per svd plot
[ ] Get a singular values per head plot
[ ] See how useful these are.
[ ] Look for high cosine similarity directions or something? Kinda like composition
[ ] Look at direct editing of the SVD decomp.

jbloomAus / DecisionTransformerInterpretability

SVD Decomp / Explore ways to use dimensionality reduction to quickly understand what heads are doing. #69