stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
http://pyvene.ai
Apache License 2.0
608 stars 59 forks source link

[P1] Tutorial of Inference-time Intervention #68

Closed frankaging closed 8 months ago

frankaging commented 8 months ago

Descriptions:

Interventions on activations at inference to steer model behaviors are good applications of this library. It fits the ultimate goal of this library well. Ideally, people should be able to share their steering mounting point along with injecting vectors with others easily.

Original GitHub: https://github.com/likenneth/honest_llama

frankaging commented 8 months ago

updates: its hard to find the raw activation addition, and i will probably do a model weight diff by loading https://huggingface.co/likenneth/honest_llama2_chat_7B and the original one to get head diff and then apply.

the original implementation is with BauKit to do the intervention, i am hoping to show we can save the weight diff along with intervention config so ppl can apply to act diff directly.