SQUAD evalution - Githubissues

obhalerao97 commented 8 months ago

Hello, I'm working on evaluating llama-2-70b-chat with respect to the SQUAD dataset, but it seems like the EM and F1 score don't match the scores mentioned in the paper. Not sure what I'm doing differently, could you clarify on how many samples of the SQUAD dataset is the model being evaluated on and what the system prompt looks like for this particular task.

Thank you

Sanchit-404 commented 8 months ago

Having the same issue with 7b model. There is a significant difference with reported scores. Is there any script to replicate the results?

obhalerao97 commented 8 months ago

Follow up: Im getting a exact match score of 68.4 and the paper mentions a score of 80.7, could someone help explain this difference?

tangbinh commented 8 months ago

@obhalerao97 @Sanchit-404 We use a prompt similar to what was described in Ouyang et al. (2022) where the "zero-shot" prompt for a question includes preceding questions and answers from the same paragraph (Figure 20).

obhalerao97 commented 8 months ago

@tangbinh thank you for your response, please correct me if I'm wrong, but the examples shown seem like a 1-shot prompt since we are giving a question and an answer before asking the actual question.

Sanchit-404 commented 8 months ago

Follow up: Im getting a exact match score of 68.4 and the paper mentions a score of 80.7, could someone help explain this difference?

Can you please share your script?

obhalerao97 commented 8 months ago

@tangbinh I also had a question with regard to how you're calculating the exact match and F1 score, for each question there are multiple answers, are taking the max EM and F1 over all the answers or averaging it over all the answers?

tangbinh commented 8 months ago

@tangbinh thank you for your response, please correct me if I'm wrong, but the examples shown seem like a 1-shot prompt since we are giving a question and an answer before asking the actual question.

The paper uses the "zero-shot" term for this setting, so we simply followed the convention. I agree this is not precise, but we can understand that "zero" here denotes the count for paragraph examples, not question-answer pairs. I think this setting is more relevant for dialog datasets such as QuAC, where information from preceding questions is required to answer the current one.

how you're calculating the exact match and F1 score, for each question there are multiple answers, are taking the max EM and F1 over all the answers or averaging it over all the answers?

Yes. We calculate exact match and F1 scores for each example by taking the max over all possible answers.

obhalerao97 commented 8 months ago

@tangbinh Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

tangbinh commented 8 months ago

Can you please share your script?

@Sanchit-404 We run all benchmarks using our evaluation framework, so unfortunately it wouldn't be easy to share a script for SQuAD at this point. That said, we may consider open-source the evaluation framework in the future.

Sanchit-404 commented 8 months ago

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

There are plausible answers attached to questions which are impossible to answer. I think they are deemed acceptable. But even i have the same doubt.

Sanchit-404 commented 8 months ago

@obhalerao97 Can you share your evaluation script? I used the EleutherAI LM-eval framework which gave 32% accuracy only.

obhalerao97 commented 8 months ago

@obhalerao97 Can you share your evaluation script? I used the EleutherAI LM-eval framework which gave 32% accuracy only.

I would love to, but this the script cannot be shared unfortunately since I don't directly own it.

tangbinh commented 8 months ago

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

We evaluate on all questions from SQuAD V2, include those that are not answerable. We use the exact same prompt from Ouyang et al. (2022).

Sanchit-404 commented 8 months ago

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

We evaluate on all questions from SQuAD V2, include those that are not answerable. We use the exact same prompt from Ouyang et al. (2022).

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

We evaluate on all questions from SQuAD V2, include those that are not answerable. We use the exact same prompt from Ouyang et al. (2022).

If i am not mistaken, you use the plausbible_answers as gold for questions tagged is_impossible? If not, isnt that unfair either ways without explicit instruction to not answer a prompt, given there is no way to automatically evaluate questions which cannot be answered?

obhalerao97 commented 8 months ago

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

We evaluate on all questions from SQuAD V2, include those that are not answerable. We use the exact same prompt from Ouyang et al. (2022).

@tangbinh I did try this implementation but it does not give me minimum spanning answers. For example, if the ground truth is "ten billion", the predicted text is "there were ten billion people". This seems to be the case for almost all questions, which in turn gives me an exact score of 36-37%. My initial assumption was that the model was evaluated only on questions that had answers, which brought up my scores.

obhalerao97 commented 8 months ago

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

We evaluate on all questions from SQuAD V2, include those that are not answerable. We use the exact same prompt from Ouyang et al. (2022).

Thank you again for your response, another question I had was, from the SQUAD dataset, are you only evaluating on questions that have answers? If no, how are you evaluating on questions where the answers are impossible? Are there any changes to the system prompt (the one mentioned in the paper) to account for this?

We evaluate on all questions from SQuAD V2, include those that are not answerable. We use the exact same prompt from Ouyang et al. (2022).

If i am not mistaken, you use the plausbible_answers as gold for questions tagged is_impossible? If not, isnt that unfair either ways without explicit instruction to not answer a prompt, given there is no way to automatically evaluate questions which cannot be answered?

@Sanchit-404 refer to this paper (Figure 20) for the user prompt: https://arxiv.org/pdf/2203.02155.pdf

obhalerao97 commented 8 months ago

@tangbinh after multiple experiments with the suggested method, I still seem to be getting incorrect answers, not sure where I'm going wrong. (I'm using the 70b-chat, where each input to the model is in the form of a dialog)

An example prompt:

system_prompt: """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

user_prompt: """Answer each question using information in the preceding background paragraph. If there is not enough information provided, answer with "Not in background."
Title: Southern_California
Background: "Southern California" is not a formal geographic designation, and definitions of what constitutes southern California vary. Geographically, California's north-south midway point lies at exactly 37° 9' 58.23" latitude, around 11 miles (18 km) south of San Jose; however, this does not coincide with popular use of the term. When the state is divided into two areas (northern and southern California), the term "southern California" usually refers to the ten southern-most counties of the state. This definition coincides neatly with the county lines at 35° 47′ 28″ north latitude, which form the northern borders of San Luis Obispo, Kern, and San Bernardino counties. Another definition for southern California uses Point Conception and the Tehachapi Mountains as the northern boundary.
Question: Geographically speaking, where is California's north - south midway point in terms of latitude? Answer: 37° 9' 58.23"
Question: How many miles south of San Jose is the north - south midway point located?
Answer: 11
Question: The term "southern" California usually refers to how many of the southern-most counties of the state?
Answer: ten
Question: Other than Point Conception, what landmark is used in the other definition of southern California?
Answer:
"""

Model output: Not in background. The background information does not mention any other landmark being used in the definition of southern California besides Point Conception.

Expected Output: Tehachapi Mountains

For almost all the answers, I get the mentioned "Model Output" or something that does not make sense. Not sure how to solve this.

tangbinh commented 8 months ago

@obhalerao97 We reported numbers for the Llama pretrained models in the paper. We haven't tried the chat versions, but I think you might need to do more output filtering with them.

obhalerao97 commented 8 months ago

@tangbinh even for the pretrained model (assuming you're using the text completion format), I get an output as such:

The model generates random questions and answers:

Model Output: Question: What is the other definition of southern California? Answer: Question: How many southern-most counties of California are there? Answer: 10 Question: What is the latitude of the north - south midway point? Answer:

Sanchit-404 commented 8 months ago

@tangbinh after multiple experiments with the suggested method, I still seem to be getting incorrect answers, not sure where I'm going wrong. (I'm using the 70b-chat, where each input to the model is in the form of a dialog)

An example prompt:

system_prompt: """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""

user_prompt: """Answer each question using information in the preceding background paragraph. If there is not enough information provided, answer with "Not in background." Title: Southern_California Background: "Southern California" is not a formal geographic designation, and definitions of what constitutes southern California vary. Geographically, California's north-south midway point lies at exactly 37° 9' 58.23" latitude, around 11 miles (18 km) south of San Jose; however, this does not coincide with popular use of the term. When the state is divided into two areas (northern and southern California), the term "southern California" usually refers to the ten southern-most counties of the state. This definition coincides neatly with the county lines at 35° 47′ 28″ north latitude, which form the northern borders of San Luis Obispo, Kern, and San Bernardino counties. Another definition for southern California uses Point Conception and the Tehachapi Mountains as the northern boundary. Question: Geographically speaking, where is California's north - south midway point in terms of latitude? Answer: 37° 9' 58.23" Question: How many miles south of San Jose is the north - south midway point located? Answer: 11 Question: The term "southern" California usually refers to how many of the southern-most counties of the state? Answer: ten Question: Other than Point Conception, what landmark is used in the other definition of southern California? Answer: """

Model output: Not in background. The background information does not mention any other landmark being used in the definition of southern California besides Point Conception.

Expected Output: Tehachapi Mountains

For almost all the answers, I get the mentioned "Model Output" or something that does not make sense. Not sure how to solve this.

Even i am facing the same issues. adding the lines regarding model output for underspecified questions leads to performance degradation for factual prompts. It either repeats the entire prompt or asks questions back. I believe this is because whatever version of models you use, they are efficient only till around 500 tokens. 4096 is the upper bound of tokens they accept but the performance drops significantly after 500 tokens with increased hallucinations even for the very simple questions.

obhalerao97 commented 7 months ago

@tangbinh thank you for you suggestion! I'm able to reproduce the SQUAD result mentioned in the paper. I was wondering, if you could guide me on what the prompt format was used for ARC and MMLU as well? That would be greatly appreciated.

Sanchit-404 commented 7 months ago

@tangbinh thank you for you suggestion! I'm able to reproduce the SQUAD result mentioned in the paper. I was wondering, if you could guide me on what the prompt format was used for ARC and MMLU as well? That would be greatly appreciated.

What did you change from the last time?

obhalerao97 commented 7 months ago

@tangbinh thank you for you suggestion! I'm able to reproduce the SQUAD result mentioned in the paper. I was wondering, if you could guide me on what the prompt format was used for ARC and MMLU as well? That would be greatly appreciated.

What did you change from the last time?

I followed the instructions and paper @tangbinh referred to, it worked perfectly. The paper has a GitHub linked that has examples of 0-shot SQUAD prompts, follow the same format

Sanchit-404 commented 7 months ago

@obhalerao97 what was your performance on questions marked as is_impossible=True? I am getting a 0 F1 score using the template mentioned in the paper.

changyuying commented 5 months ago

@tangbinh thank you for your response, please correct me if I'm wrong, but the examples shown seem like a 1-shot prompt since we are giving a question and an answer before asking the actual question.

The paper uses the "zero-shot" term for this setting, so we simply followed the convention. I agree this is not precise, but we can understand that "zero" here denotes the count for paragraph examples, not question-answer pairs. I think this setting is more relevant for dialog datasets such as QuAC, where information from preceding questions is required to answer the current one.

how you're calculating the exact match and F1 score, for each question there are multiple answers, are taking the max EM and F1 over all the answers or averaging it over all the answers?

Yes. We calculate exact match and F1 scores for each example by taking the max over all possible answers.

@tangbinh Hello, may I ask if the following prompt format is correct?

Title: Amazon_rainforest

Background: One computer model of future climate change caused by greenhouse gas emissions shows that the Amazon rainforest could become unsustainable under conditions of severely reduced rainfall and increased temperatures, leading to an almost complete loss of rainforest cover in the basin by 2100. However, simulations of Amazon basin climate change across many different models are not consistent in their estimation of any rainfall response, ranging from weak increases to strong decreases. The result indicates that the rainforest could be threatened though the 21st century by climate change in addition to deforestation.

Q: What change in conditions may make the Amazon rainforest unsustainable?

A: reduced rainfall and increased temperatures|severely reduced rainfall and increased temperatures

Q: A complete loss of rainforest cover may be caused by what type of emissions?

A: greenhouse gas emissions|greenhouse gas

Q: If one computer model turns out correct, by what year would there be a nearly complete loss of rainforest in the Amazon basin?

A: 2100|by 2100

Q: How long may the Amazon rainforest be threatened, according to some computer models?

A: though the 21st century

Q: What are the main threats facing the Amazon rainforest in the current century?

A: climate change in addition to deforestation

Q: Increased rainfall and decreased temperatures may make what unsustainable?

A: Not in background.

Q: A decrease in greenhouse gases may lead to a complete loss of what?

A: Not in background.

Q: Some computer models suggest the rain forest will become threatened after what?

A:

Here is the output of the model:

2100|by 2100

Q: What is the result of simulations of Amazon basin climate change across many different models?

A: not consistent in their estimation of any rainfall response|ranging

Is it feasible to filter answers based solely on "2100|by 2100" for this form of answer? In addition, "zero-shot" prompt for a question includes preceding questions and answers from the same paragraph. If the question is the first question under this paragraph, should the following question be taken as a prompt, or should the prompt only include this one question. Lastly, Is it feasible for us to use pre trained models for testing on SQuAD and make modifications directly based on the "example_text_completion. py" file in your project?

changyuying commented 5 months ago

@tangbinh thank you for you suggestion! I'm able to reproduce the SQUAD result mentioned in the paper. I was wondering, if you could guide me on what the prompt format was used for ARC and MMLU as well? That would be greatly appreciated.

Hello, I would like to inquire whether you should only use the questions and answers before each paragraph as prompts when using the 0-shot settings, or use all questions except for that paragraph as prompts. For example, if the current problem is the first problem corresponding to this paragraph, and there were no other problems before, would the effect be relatively poor for this problem.

changyuying commented 5 months ago

@obhalerao97 @Sanchit-404 We use a prompt similar to what was described in Ouyang et al. (2022) where the "zero-shot" prompt for a question includes preceding questions and answers from the same paragraph (Figure 20).

Hello, I would also like to inquire if you are using the entire validation set or filtered data when testing the llama pre training model on the SQuAD and QuAC datasets?

meta-llama / llama

SQUAD evalution #867