Description:
Currently, we assume model weights are frozen when training intervention for alignments. We can also add support to this library so that models can be tuned with the intervention.
This can help reproduce interchange intervention training experiments in this paper. Or it can be used to reproduce experiments in the causal proxy model (i.e., using another explainer model to explain a Blackbox model)
Description: Currently, we assume model weights are frozen when training intervention for alignments. We can also add support to this library so that models can be tuned with the intervention.
This can help reproduce interchange intervention training experiments in this paper. Or it can be used to reproduce experiments in the causal proxy model (i.e., using another explainer model to explain a Blackbox model)