Discrepancy in Reproducing Results with llama-7B on TriviaQA Dataset

likenneth / honest_llama

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

MIT License

461 stars 36 forks source link

Discrepancy in Reproducing Results with llama-7B on TriviaQA Dataset #30

Closed XianfengJiao closed 4 months ago

XianfengJiao commented 9 months ago

I am currently attempting to replicate the results presented in your paper using the llama-7B model on the iti_trivia_qa_val. However, I am encountering a significant discrepancy between my replicated results and those reported in your paper.

In my approach, I used the 'value' from the 'answer' field in the dataset as the Best Answer, the contents of 'aliases' as the Correct Answers, and 'false_answer' as the Incorrect Answers. Based on these, I calculated the following metrics:

MC1: 0.3598
MC2: 0.6605 These values are notably lower than the 0.896 reported in your study.

Could you please provide some insight into the specific metric calculation methods used in your paper? I am keen to understand if there are any differences in our approaches that could account for this discrepancy. Additionally, any guidance on accurately replicating your results would be highly appreciated.

0amp commented 9 months ago

Hello, could you explain what metric you are using to compare answers? The reported numbers are using indicator of log prob on sampled correct answer > log prob on incorrect answer. This may require carefully selecting the right sequence index (after the question) to begin calculating log prob at.

XianfengJiao commented 9 months ago

Thank you for your response. I would like to clarify that I am also using the indicator of log prob on sampled correct answer > log prob on incorrect answer, similar to the MC1 metric used in the TruthfulQA.

However, I believe there might be a difference in how we are sampling the correct answers. In my approach, I used the 'value' field from the 'answer' column in the dataset you provided (iti_trivia_qa_val) as the correct answers. Could you please guide me on the appropriate method to sample correct answers in order to replicate the results presented in your paper?

0amp commented 9 months ago

Could you explain how you are currently calculating MC1 and MC2? For MC1, for example, it should be only 1 correct answer compared to multiple incorrect but for trivia qa there is just one provided incorrect answer.

Yes, we also sample a correct answer from ['answer']['aliases'] and compare the log prob to the incorrect answer. Perhaps sampling is an issue? Note, we also use an instruction prompt with few-shot demonstrations as suggested in the original LLaMA paper. Perhaps this may be helpful:

prefix = 'Answer these questions:\n'
if num_few_shot > 0:
    rand_idxs = np.random.choice(len(trivia_qa), num_few_shot, replace=False).tolist()
    for i in rand_idxs:
        ans_idx = np.random.choice(len(trivia_qa[i]['answer']['aliases']), 1)[0]
        prefix += 'Q: ' + trivia_qa[i]['question'] + '?\nA: ' + trivia_qa[i]['answer']['aliases'][ans_idx] + '\n\n'

prompt = prefix + 'Q: ' + question + '?\nA: '

fc2869 commented 6 months ago

How many shots did you use for the few-shot demonstrations for TriviaQA and natural questions? Thanks in advance.

likenneth commented 6 months ago

Both were using zero-shot.