Add an AutoencoderLayer and an AutoencoderIntervention to support interpretability methods that use autoencoders to learn interpretable feature space, including Sparse Autoencoders.
The AutoencoderLayer defines any autoencoder with a single-layer encoder and a single-layer decoder. Users can additionally define customized autoencoders by extending the base class AutoencoderLayerBase.
The AutoencoderIntervention defines an intervention that allows interchange interventions in the latent space of the autoencoder.
The AutoencoderIntervention supports loading pre-trained autoencoders trained outside pyvene framework, with the get_intervenable_with_autoencoder function below:
def get_intervenable_with_autoencoder(
model, autoencoder, intervention_dimensions, layer):
intervention = pv.AutoencoderIntervention(
embed_dim=autoencoder.input_dim,
latent_dim=autoencoder.latent_dim)
# Copy the pretrained autoencoder.
intervention.autoencoder.load_state_dict(autoencoder.state_dict())
intervention.set_interchange_dim(interchange_dimensions)
inv_config = pv.IntervenableConfig(
model_type=type(model),
representations=[
pv.RepresentationConfig(
layer, # layer
"block_output", # intervention repr
"pos", # intervention unit
1, # max number of unit
intervention=intervention,
latent_dim=autoencoder.latent_dim)
],
intervention_types=pv.AutoencoderIntervention,
)
intervenable = pv.IntervenableModel(inv_config, model)
intervenable.set_device("cuda")
intervenable.disable_model_gradients()
return intervenable
The resulting intervenable, including the intervention dimensions and the autoencoder, can be saved as:
Description
Add an
AutoencoderLayer
and anAutoencoderIntervention
to support interpretability methods that use autoencoders to learn interpretable feature space, including Sparse Autoencoders.AutoencoderLayer
defines any autoencoder with a single-layer encoder and a single-layer decoder. Users can additionally define customized autoencoders by extending the base classAutoencoderLayerBase
.AutoencoderIntervention
defines an intervention that allows interchange interventions in the latent space of the autoencoder.The
AutoencoderIntervention
supports loading pre-trained autoencoders trained outsidepyvene
framework, with theget_intervenable_with_autoencoder
function below:The resulting intervenable, including the intervention dimensions and the autoencoder, can be saved as:
Fix #77
Testing Done
[internal only] https://colab.research.google.com/drive/1_fxM7JUqkMy6Erz6K1JV0NwQBw1r8g0k?usp=sharing
Will add this colab as a tutorial.
Checklist:
[Your Priority] Your Title