likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
MIT License
478 stars 37 forks source link

Saving the model after shifting activations #12

Closed A-Raafat closed 1 year ago

A-Raafat commented 1 year ago

Hello, how do you get the results for w/o ITI, do you manually put intervensions = {} in alt_tqa_evaluate function?

Also i have another question, how do you save the new model after changing the activations direction ? if i save the model under with TraceDict(model, layers_to_intervene, edit_output=intervene) as ret: will that produce the model with ITI used?

likenneth commented 1 year ago

Hi there,

(1) I put alpha as 0 (2) The proposed ITI stands for inference-time intervention, which means the model weights are never changed.