likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
MIT License
426 stars 32 forks source link

ask about the insight behind ασθ #37

Closed NieSYsc20 closed 1 month ago

NieSYsc20 commented 1 month ago

Hi, great work!

In the paper, ασθ is added after the Attn(·) output, and it is explained in the paper as: "This is equivalent to shifting activations along the truthful directions for α times the standard deviation."

I am confused about the insight behind this design. What is the specific meaning of "α times the standard deviation" in this paper? Why can the activation vector be calculated in this way? Could you please provide a more detailed explanation?

Thanks!

likenneth commented 1 month ago

Hello,

The activation vector is determined by the probing process. The alpha and sigma (both scalars) only control the strength of the intervention. Alpha is a hyper-parameter. Sigma is the standard deviation of the features along the truthful direction.

The intuition here is that the feature space is anisotropic, requiring different intervention strength at different directions. Controlling with sigma is a preliminary approach I employed to address this issue.