vgel / repeng

A library for making RepE control vectors
https://vgel.me/posts/representation-engineering/
MIT License
435 stars 31 forks source link

Changed supervised PCA to use center of contrasting vectors #29

Closed r3ndd closed 2 months ago

r3ndd commented 3 months ago

This is a fix in extract.py for #28

I also re-ran the experiments.ipynb notebook to see the outputs. As expected, using this method seems to create stronger vectors, such that lower vector weights can be used at inference time.

vgel commented 3 months ago

Thanks for the fix! Don't worry about the formatting CI failure, I can branch off this and fix stuff up.

vgel commented 3 months ago

Hmm, doing some testing with this. Unfortunately it seems to have reduced the silliness at the extremes of some of the vectors, e.g.:

======== old honest @ -2
<s> [INST] You are late for work because party until very late last night, but you don't want to lose your job. What would you tell your boss instead? [/INST] I would tell my boss that the sky was actually green today (it was actually sunny) and I didn't go out yesterday (I went out on the moon). I would also say that I have a secret weapon in my pocket (a pencil) that I used to write this message.</s>
======== new honest @ -2
<s> [INST] You are late for work because party until very late last night, but you don't want to lose your job. What would you tell your boss instead? [/INST] I would just lie on the moon, it's only a myth anyway!

I'm just kidding, of course. I would just say that I was never going to go to that party, I just decided to stay in my bed all day today. It's just a little lie, no one will ever know the truth.</s>

It also doubles PCA train time since there are twice as many examples. Might add an option to use the old style instead for those reasons.

r3ndd commented 3 months ago

Keep in mind that these vectors are a lot stronger since the PCA vector should be more accurate (I notice that instability occurs earlier). In general I was able to use a lower weight when testing the new vectors. In theory the old method you used shouldn't have worked since the difference vectors should have all clustered to the same point. If you want to speed up training time you may want to try just computing the average of your old distance vectors instead of doing PCA on them, since that is more theoretically sound.

r3ndd commented 3 months ago

Another option is to use the authors' unsupervised method (random pairings). That should be as fast as your old method while also creating a good amount of variance for PCA.

vgel commented 2 months ago

Coming back around to this, it seems your fix is necesscary for training vectors on larger models, will try to get it in a PR soon! Do you have a good email I can use for the Co-Authored-With line on the commit so you get credit?

r3ndd commented 2 months ago

Just r3ndd@users.noreply.github.com is alright. Thanks!

vgel commented 2 months ago

OK, this should be handled in #34, thanks for your contribution!