concerns about your experiments of TruthfulQA-MC

voidism / DoLa

Official implementation for the paper "DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models"

https://arxiv.org/abs/2309.03883

419 stars 50 forks source link

concerns about your experiments of TruthfulQA-MC #7

Closed 0-KaiKai-0 closed 8 months ago

0-KaiKai-0 commented 12 months ago

I am very excited with your fantastic work, while there also exisits a little concern about your experiments.

Following your instructions, I can obtain $(MC1, MC2, MC3) = (32.07, 63.78, 32.06)$ for TruthfulQA-MC, which are almost the same as the results you reported in your paper. However, I notice that you set post_softmax=False during evalution. It conflcts with the equation (4) in your paper.

I have tried to set post_softmax=True, and the scores drops to $(31.95, 52.21, 28.15)$ but still better than vanilla.

I am concerned that it is not supported to use the logits directly without using a softmax to convert logits to probabilities. I would appreciate it if you have any insights to explain such a confusion.

https://github.com/voidism/DoLa/blob/dc88907406f9744f748f3c779f2353efd5bdc824/tfqa_mc_eval.py#L302C318-L302C336

voidism commented 8 months ago

Hi,

Sorry to make this confusion. Your observation is correct. Setting post_softmax=False can slightly improve the performance on TruthfulQA-MC. We reported the improved score but forgot to describe this detail in the paper. We will soon update a new version of arxiv paper that includes such observation and report both scores of setting post_softmax=True or False

It makes more sense to set post_softmax=True to convert logits to probabilities and strictly follow the math equation. The improvement from post_softmax=False is hard to explain for me as it's more about empirical results. The reason may be related to the data distribution of TruthfulQA-MC itself.

enkeejunior1 commented 4 months ago

I also truly enjoy your work, but I have some concerns about this part..

The most worrisome aspect is whether it can be considered a fair comparison without applying softmax. tfqa mc calculates the probability of an answer based on logit values, but can we say that diff_logits without applying softmax still carry the information of log prob..? From an engineering perspective, if we want to interpret diff_logits as log probabilities without applying softmax, we should also avoid using softmax in other generation tasks too.

I apologize in advance if this comes across as offensive. If post-softmax were applied, the amazing performance improvement in tfqa-mc would disappear, but I still think it is a very good, solid paper.