Open zdaiot opened 1 week ago
@ianand 。Hello, I would like to ask about the datasets used in training SAEs and those used in analyzing SAEs. Are there any particular considerations? Do they need to be the same?
For instance, why is HuggingFaceFW/fineweb
used when training jbloom/Gemma-2b-Residual-Stream-SAEs
, and apollo-research/roneneldan-TinyStories-tokenizer-gpt2
used when training tiny-stories-1L-21M
? Skylion007/openwebtext
is used in training gpt2-small-res-jb
, but NeelNanda/pile-10k
is used in analyzing gpt2-small-res-jb
. Why is this the case?
@zdaiot Usually we like to use the training data to train the SAE but we often don't know what that is. Pre-tokenized datasets are better and we have a utility for generating those now. It gets more complicated if you want to work with instruction fine tuned datasets and I have no special insights here.
@zdaiot let me know if I can close this :)
@jbloomAus May I ask if the same dataset is needed when training the SAE and analyzing the SAE?
Thank you for your excellent work. I'd like to know how to reproduce jbloom/Gemma-2b-Residual-Stream-SAEs
Can you open source the configuration file for training it?