jbloomAus / SAELens

Training Sparse Autoencoders on Language Models
https://jbloomaus.github.io/SAELens/
MIT License
481 stars 127 forks source link

feat: Add linear probe trainer #356

Open tom-pollak opened 3 weeks ago

tom-pollak commented 3 weeks ago

Description

@chanind @jbloomAus

Created linear probe trainer which can take a hf dataset (created like in #321) and train a linear probe (provided we have labels for the activations)

This can make a pretty wandb run with all your stats by default. similar to training the SAE.

Screenshot 2024-10-31 at 10 23 39

(Training run is clearly very bad)

So far just tried to keep the code simple, don't know if there's any interest in a probe trainer like this.


Type of change

Please delete options that are not relevant.

Checklist:

You have tested formatting, typing and unit tests (acceptance tests not currently in use)

chanind commented 3 weeks ago

I would personally view this as out-of-scope of this library, since there's nothing SAE-specific about training a linear probe. Curious to hear others' thoughts though!

hijohnnylin commented 3 weeks ago

I would personally view this as out-of-scope of this library, since there's nothing SAE-specific about training a linear probe. Curious to hear others' thoughts though!

FWIW, I'm a +1 to including this in SAELens, and/or some repo similarly easily accessible. possibly the scope of SAELens should be widened. I would prefer not to have a separate library for TranscoderLens, CrosscoderLens, ProbeLens etc. Though I defer to you, @curt-tigges, etc.

curt-tigges commented 3 weeks ago

We've had some discussion internally about creating and maintaining a probe-specific library, which I think would be a better fit than SAELens for this. I think one of the main benefits of having a library is having common code etc. but probing is rather independent and there isn't really anything shared with the rest of the code. There's not really a specific reason to have it in SAELens.

Proposal: I'm inclined to just go ahead and create a sketch for an initial probing library and ask @tom-pollak to resubmit this code to that. @jbloomAus any thoughts?