Closed XianfengJiao closed 4 months ago
Hello, could you explain what metric you are using to compare answers? The reported numbers are using indicator of log prob on sampled correct answer > log prob on incorrect answer. This may require carefully selecting the right sequence index (after the question) to begin calculating log prob at.
Thank you for your response. I would like to clarify that I am also using the indicator of log prob on sampled correct answer > log prob on incorrect answer, similar to the MC1 metric used in the TruthfulQA.
However, I believe there might be a difference in how we are sampling the correct answers. In my approach, I used the 'value' field from the 'answer' column in the dataset you provided (iti_trivia_qa_val) as the correct answers. Could you please guide me on the appropriate method to sample correct answers in order to replicate the results presented in your paper?
Could you explain how you are currently calculating MC1 and MC2? For MC1, for example, it should be only 1 correct answer compared to multiple incorrect but for trivia qa there is just one provided incorrect answer.
Yes, we also sample a correct answer from ['answer']['aliases'] and compare the log prob to the incorrect answer. Perhaps sampling is an issue? Note, we also use an instruction prompt with few-shot demonstrations as suggested in the original LLaMA paper. Perhaps this may be helpful:
prefix = 'Answer these questions:\n'
if num_few_shot > 0:
rand_idxs = np.random.choice(len(trivia_qa), num_few_shot, replace=False).tolist()
for i in rand_idxs:
ans_idx = np.random.choice(len(trivia_qa[i]['answer']['aliases']), 1)[0]
prefix += 'Q: ' + trivia_qa[i]['question'] + '?\nA: ' + trivia_qa[i]['answer']['aliases'][ans_idx] + '\n\n'
prompt = prefix + 'Q: ' + question + '?\nA: '
How many shots did you use for the few-shot demonstrations for TriviaQA and natural questions? Thanks in advance.
Both were using zero-shot.
I am currently attempting to replicate the results presented in your paper using the llama-7B model on the iti_trivia_qa_val. However, I am encountering a significant discrepancy between my replicated results and those reported in your paper.
In my approach, I used the 'value' from the 'answer' field in the dataset as the Best Answer, the contents of 'aliases' as the Correct Answers, and 'false_answer' as the Incorrect Answers. Based on these, I calculated the following metrics:
Could you please provide some insight into the specific metric calculation methods used in your paper? I am keen to understand if there are any differences in our approaches that could account for this discrepancy. Additionally, any guidance on accurately replicating your results would be highly appreciated.