Issues related to reproducing the results of the paper

wytbwytb commented 6 months ago

Hello ! I have been working on reproducing the results from your paper, but I encountered some inconsistencies and would greatly appreciate your guidance.

Firstly, I reproduced the Alpaca-7B results on TruthfulQA using the provided repository and obtained the following results: 1717294644653 I noticed a significant decrease in the Truth and Truth * Info metrics after the intervention. Could this be due to an issue with my parameter settings? Additionally, since OpenAI has discontinued the fine-tunable Curie version of GPT-3.5, I have been using the Babbage version instead. Might this be contributing to the unexpected results?

Secondly, I reproduced the generalization results on Natural Questions using the directions calculated from TruthfulQA, and found that the MC1 performance deteriorated: Could this also be related to my parameter settings?

I would be extremely grateful for any insights or suggestions you could provide.

Thank you very much for your time and assistance.

Glad to hear your response!

likenneth commented 6 months ago

As seen in the "random direction" row of the Table 3 in ITI paper, it's actually non-trivial to find a direction to degrade the truthfulness of LLM; usually it won't have any effect on the model. Given various changes in the dataset, evaluation, and model, I can't predict what's going wrong on your side, but I would suggest one thing: have you tried intervene on the opposite directions?

wytbwytb commented 5 months ago

Thanks for your response! I will try it later.

Additionally, I have a question regarding the generalizability experiments. Could you please clarify if the experiments on Natural Questions directly use the probes trained on TruthfulQA for intervention, or if there are any other manipulations involved? Also, is it necessary to calculate the variance on the Natural Questions data?

Glad to hear your response!

likenneth commented 5 months ago

No additional changes made. The std should be inherent to the model and found directions so it's not re-calculated on downstream datasets.

likenneth / honest_llama

Issues related to reproducing the results of the paper #35