Open utterances-bot opened 9 months ago
Hi, nice blog, thanks for sharing it! Just wanted to warn you that your hf and wandb keys are still in the train colab you linked. Could you make the wandb report public ? It would be helpful to check the compute you needed. Also, there is an image missing for this description
The second most frequent feature (feature index ...) in the Pythia 6.9B sparse autoencoder activates on the token "·the".
Sparse Autoencoders for a More Interpretable RLHF | Naomi Bashkansky
Extending Anthropic's recent monosemanticity results toward a new, more interpretable way to fine-tune.
https://naomibashkansky.com/blog/2023/sparse-autoencoders-for-interpretable-rlhf/