iclr-blogposts / 2023

DO NOT FORK OR MAKE PRS TO THIS REPO! Please do this with the staging repo: https://github.com/iclr-blogposts/staging
https://iclr-blogposts.github.io/2023/
MIT License
1 stars 40 forks source link

blog/2023/sparse-autoencoders-for-interpretable-rlhf/ #16

Open utterances-bot opened 9 months ago

utterances-bot commented 9 months ago

Sparse Autoencoders for a More Interpretable RLHF | Naomi Bashkansky

Extending Anthropic's recent monosemanticity results toward a new, more interpretable way to fine-tune.

https://naomibashkansky.com/blog/2023/sparse-autoencoders-for-interpretable-rlhf/

Butanium commented 9 months ago

Hi, nice blog, thanks for sharing it! Just wanted to warn you that your hf and wandb keys are still in the train colab you linked. Could you make the wandb report public ? It would be helpful to check the compute you needed. Also, there is an image missing for this description

The second most frequent feature (feature index ...) in the Pythia 6.9B sparse autoencoder activates on the token "·the".