jbloomAus / SAELens

Training Sparse Autoencoders on Language Models
https://jbloomaus.github.io/SAELens/
MIT License
193 stars 67 forks source link

feat: add w_dec_norm folding #167

Closed jbloomAus closed 1 month ago

jbloomAus commented 1 month ago

Description

Implements a simple SAE method for folding W_dec norm weights of Anthropic style SAEs into encoder such that W_dec features are unit norm. See Anthropic update here: https://transformer-circuits.pub/2024/april-update/index.html#training-saes

Screenshot 2024-05-29 at 3 24 26 PM

I have tested that the feature activations and sae out are as expected. It's possible we should make this the default when loading from pretrained (so that feature activations are conceptually what you expect them to be). I may submit a PR soon which makes this the case.

Type of change

Please delete options that are not relevant.

Checklist:

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 56.35%. Comparing base (0550ae3) to head (3d02279). Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #167 +/- ## ========================================== + Coverage 56.25% 56.35% +0.10% ========================================== Files 25 25 Lines 2597 2603 +6 Branches 439 440 +1 ========================================== + Hits 1461 1467 +6 Misses 1061 1061 Partials 75 75 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.