likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
MIT License
461 stars 36 forks source link

How to calculate the gap between generation accuracy and probe accuracy, which is 40% mentioned in the paper? #28

Closed DLiquor closed 10 months ago

DLiquor commented 10 months ago

Hi, thanks for your work! I have tried ITI with Baichuan on my Chinese dataset. However, it dose not work well for more truthfulness. Thus, I want to know whether there is a gap in my scenario. You mentioned the gap of 40% between generation acc and probe acc. I am wondering how to calculate them? Dose the generation acc mean the true% of the baseline? For the probe acc, Is there specific binary classifier used for each attention head? like 32*32 for llama? Hope for your reply!

likenneth commented 10 months ago

Hello. The "generation accuracy" is measured from your specific truthfulness benchmark. I guess you are measuring it with your own benchmark, since TruthfulQA doesn't work for the Chinese language? The "internal accuracy" is from linear-probing all attention heads and take the max validation set accuracy (4:1 split).

DLiquor commented 10 months ago

Thanks for your reply!