stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
http://pyvene.ai
Apache License 2.0
543 stars 45 forks source link

[P2] Sparse autoencoders #77

Closed aryamanarora closed 1 day ago

aryamanarora commented 5 months ago

We should add support for training sparse autoencoders (Bricken et al., 2023, Cunningham et al., 2023). Cool be cool as a way of obtaining a feature basis for interventions.

frankaging commented 5 months ago

@explanare has this locally setup I think.

smejak commented 5 months ago

What functionality should this PR have for best integration with existing pyvene tools? I have an implementation of SAE training working with transformerlens models (largely inspired by Neel Nanda’s code but adjusted such that experiments are easier to modify). Currently the Buffer class collects the activations from a layer with the model.run_with_cache function, I presume there is an equivalent pyvene function that should be used instead? Other than that I think all the other code should currently be interoperable with pyvene.

frankaging commented 5 months ago

@smejak currently, pyvene supports activation collection as in the first two examples in https://github.com/stanfordnlp/pyvene/blob/main/pyvene_101.ipynb. Will this help?