vgel / repeng

A library for making RepE control vectors
https://vgel.me/posts/representation-engineering/
MIT License
435 stars 31 forks source link

Will there be support for models with custom architecture (not only mistral or gpt based)? #25

Open Nishant-kirito opened 4 months ago

vgel commented 4 months ago

is there a specific model you'd be interested in? It's not very hard to add support for models theoretically, so if there's a relatively-popular one you're interested in having supported I can take a look. If you have a truly custom model (that is still decoder-only, and not a MoE), you can patch it to expose a Mistral-like interface (e.g., by making a wrapper class that exposes appropriate config / layers properties) and that should just work--the per-model support is just to handle HF's lack of consistency in the model interface.

NanoCode012 commented 3 months ago

Hey @vgel , would it be possible to share on the method of adding new models ?

I’m also interested in MoE models, which I saw you explicitly mention. What challenges are there to support it (for ex, mixtral)? I’ve tried to run it with current notebooks and unfortunately don't see much difference between responses.

vgel commented 3 months ago

@NanoCode012 the main issue with MoE models is that only a subset of experts are active for a certain forward pass (by design). repeng runs a bunch of forward passes (~one per batch of examples), so that means we mix activations from different experts and run PCA over them as if they were all from the same source.

The correct thing to do would be to move the extraction/control point down from after each transformer block to after each expert, and then collect vectors by layer and expert, not just by layer, and likewise apply per-expert as well. Training datasets (and training time) would probably need to be larger to accommodate this. It would be nice to have, but I've been pretty busy so haven't been able to make it a priority.

ycros commented 3 months ago

So how do you support more models? I tried a simple test where I took the experiments notebook and I replaced mistralai/Mistral-7B-Instruct-v0.1 with mistralai/Mistral-7B-Instruct-v0.2 - and running with 0.2 gives me garbage outputs. I would have thought they'd be similar enough that it should just work, what am I missing?

image

ndavidson19 commented 3 months ago

@ycros Mistral-v0.2 uses a Rope Theta value of 1e6 and removed sliding window attention should be easy to fix within the model config parameters.

@vgel I'm interested in getting this working for the Phi-2 architecture. I might try and take a stab at it as this seems extremely powerful technique for anti-jailbreaking.

davniko commented 1 month ago

I've been playing around with this lib and got it to work with MoEs... Training the vectors is slow (quite slow) compared to dense models when training on the full all_truncated_outputs.json, and the code probably needs some refactoring/optimization.

This is using a Mixtral model (the dolphin finetune): image

MoE's seem to be able to handle larger coefficients better than other models in my short early testing. Curiously I could also get some decent results out when training the happy vector with a dataset of just 18 example pairs: image

I haven't tested it extensively yet so I don't know how robust or reliable it is tbh, but I'll push the code and an example notebook up on my fork after cleaning it up a bit, for those who might be interested (and in case @vgel is not already working on this behind the scenes).