jbloomAus / SAELens

Training Sparse Autoencoders on Language Models
https://jbloomaus.github.io/SAELens/
MIT License
386 stars 106 forks source link

attempt to train sae for othello-gpt model #177

Closed thijmennijdam closed 3 months ago

thijmennijdam commented 3 months ago

Summary

Added two notebook files:

own_othello_gpt.ipynb: In this notebook, I train and use my own model which I added to the official names list in TransformerLens. However, I encounter an error as it does not seem to be in the right HuggingFace format.

baidicoot_othello_gpt.ipynb: In this notebook, I use the model "Baidicoot/Othello-GPT-Transformer-Lens," which is in the official models list but does not have a tokenizer, thus giving an error. As a workaround I commented out a few lines in the SAE Lens train function. Now training works, but the training times are long so I have not verified if the resulting SAE makes sense. I am unsure whether this approach is correct, and want to verify this in case I manage to get my own model in the right format and I will have to do the same workaround.

bryce13950 commented 3 months ago

@jbloomAus this may be something that is split across SAELens and TransformerLens. @thijmennijdam sent me a message on Slack today in regards to his project. I asked him to open a PR with the changes he made to get his project running, so that I could look at it further. Further context with his original message...

I am Thijmen, a MSc AI student from the University of Amsterdam. I am currently working on a mechanistic interpretability project and want to use SAE Lens on my own Othello-GPT models. I have explored the codebase but am having trouble adding my own models. It seems like the models need to be part of the official list in TransformerLens. I trained my model using TransformerLens, uploaded it to Huggingface, and added it to the official list. This approach seems to work. However, I encountered errors related to my model not having a tokenizer when training with SAELens. I commented out these lines since I believe a tokenizer isn't necessary for Othello games, given the small vocabulary which can be hardcoded. With these changes, my SAE seems to be training now. However, I am a bit unsure if the method I am using for this is too hacky and if this approach will lead to more errors further down the line or if this is fine. Is this the recommended approach for adding custom GPT models, or is there a better method? I am just familiarizing myself with the repository and would greatly appreciate some confirmation on the approach I am currently taking.

Two questions for SAELens people... For a model to be completely compatible from TransformerLens, to SAELens, is there a set of requirements, even beyond what he found that we should be verifying to ensure compatibility? Second, is his solution of skipping the tokenizer, if it is not there, something that should be allowed on SAELens to ensure that TransformerLens models without tokenizers can be used in SAELens, or is that going to create more problems downstream?

jbloomAus commented 3 months ago

Sorry for the delay.

Commenting lines out to get it working seems good. Let me know if you're still stuck on this. Can message me on OSMI slack as well. OthelloGPT SAEs can be trained in < 10 minutes (vanilla training method circa towards monosemanticity). I can try to debug if you are having trouble.