This repo contains the code for the paper "Automatic Circuit Finding and Faithfulness" (COLM 2024; Arxiv link). The code for manipulating model graphs and performing EAP/-IG is contained in a submodule, EAP-IG; make sure to pull the commit / submodule here, as I've updated the EAP-IG module since releasing this experimental code. After pulling that submodule, you can install it by calling pip install -e .
in its directory. A Conda environment that can be used to run these experiments is contained in environment.yml
.
You can replicate the experiments in the paper as follows:
['ioi', 'greater-than', 'sva', 'gender-bias', 'fact-retrieval-comma', 'hypernymy-comma']
:
pareto.py
to collect results for EAP/EAP-IG. You must specify the model name (--model
), task (--task
), task metric (either logit_diff
or prob_diff
, --metric
), and --batch_size
. If the model is large, you might instead want to run pareto_big.py
, which provides options to separately set the --eval_batch_size
.get_real_values.py
, with similar options. This can be slow, as this script performs the actual activation patching metric change values with which you will compare the EAP/-IG estimates / circuits..png
or .pdf
file in the corresponding subfolder of the relevant directory in results
.
results/first_figure/first_figure.py
.results/real_pareto_combined/plot_real_pareto_normalized.py
.results/real_rank/compare_real.py
and then results/real_rank/plot_overlap.py
.overlap_heatmaps.py
as well as all-cross-task-faithfulness.py
. Then, run results/cross-task/make_all_heatmaps.py
; if you want recall heatmaps, as in Appendix I, run results/cross-task/recall_heatmaps.py
results/manual_overlap/ioi_overlap.py
and results/manual_overlap/greater_overlap.py
test_ig_iterations.py
, taking care to specify the --model
and --task
of interest, as well as the --attribution_metric
and the --eval_metric
. You can generate the figure using results/ig_test/plot_ig_test.py
.results/real_rank/plot_kendall.py
.results/real_rank/node_edge_overlap_plot.py
.results/pareto/plot_pareto_normalized_single.py
.results/real_pareto_combined/plot_real_pareto_normalized.py
.The data used in this paper is available in the data
folder.