jbloomAus / SAELens

Training Sparse Autoencoders on Language Models

https://jbloomaus.github.io/SAELens/

MIT License

193 stars 67 forks source link

readme

Screenshot 2024-03-21 at 3 08 28 pm

SAE Lens

SAELens exists to help researchers:

Train sparse autoencoders.
Analyse sparse autoencoders / research mechanistic interpretability.
Generate insights which make it easier to create safe and aligned AI systems.

Please refer to the documentation for information on how to:

Download and Analyse pre-trained sparse autoencoders.
Train your own sparse autoencoders.
Generate feature dashboards with the SAE-Vis Library.

SAE Lens is the result of many contributors working collectively to improve humanities understanding of neural networks, many of whom are motivated by a desire to safeguard humanity from risks posed by artificial intelligence.

This library is maintained by Joseph Bloom and David Chanin.

Tutorials

Loading and Analysing Pre-Trained Sparse Autoencoders
- Understanding SAE Features with the Logit Lens
- Training a Sparse Autoencoder

Join the Slack!

Feel free to join the Open Source Mechanistic Interpretability Slack for support!

Citations and References

Research:

Reference Implementations: