Description

Implements a simple SAE method for folding W_dec norm weights of Anthropic style SAEs into encoder such that W_dec features are unit norm. See Anthropic update here: https://transformer-circuits.pub/2024/april-update/index.html#training-saes

I have tested that the feature activations and sae out are as expected. It's possible we should make this the default when loading from pretrained (so that feature activations are conceptually what you expect them to be). I may submit a PR soon which makes this the case.

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

Checklist:

[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
[x] I have not rewritten tests relating to key interfaces which would affect backward compatibility

jbloomAus / SAELens

feat: add w_dec_norm folding #167

Description

Type of change

Checklist:

Codecov Report