stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
http://pyvene.ai
Apache License 2.0
609 stars 59 forks source link

[P2] Support interchange intervention training by unfreezing model weights with vanilla intervention #24

Closed frankaging closed 8 months ago

frankaging commented 10 months ago

Description: Currently, we assume model weights are frozen when training intervention for alignments. We can also add support to this library so that models can be tuned with the intervention.

This can help reproduce interchange intervention training experiments in this paper. Or it can be used to reproduce experiments in the causal proxy model (i.e., using another explainer model to explain a Blackbox model)