skywalker023 / fantom

👻 Code and benchmark for our EMNLP 2023 paper - "FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions"
https://aclanthology.org/2023.emnlp-main.890/
MIT License
44 stars 3 forks source link

Some factQA questions have identical correct_answer and wrong_answer #3

Open dniku opened 3 months ago

dniku commented 3 months ago

Inspecting the dataset with:

wget https://storage.googleapis.com/ai2-mosaic-public/projects/fantom/fantom.tar.gz
tar xvf fantom.tar.gz
jq '.[117, 124, 141, 248, 295, 333, 339, 391, 458, 660, 683, 686, 763] | .factQA | {correct_answer, wrong_answer}' fantom_v1.json

I get:

{
  "correct_answer": "Italian cuisine.",
  "wrong_answer": "Italian cuisine."
}
{
  "correct_answer": "The conversation topic shifted to first date ideas and tips when Piper joined the group.",
  "wrong_answer": "The conversation topic shifted to first date ideas and tips when Piper joined the group."
}
{
  "correct_answer": "Victor",
  "wrong_answer": "Victor"
}
{
  "correct_answer": "The conversation shifted to the topic of cooking and their favourite dishes to prepare when Amari joined the discussion.",
  "wrong_answer": "The conversation shifted to the topic of cooking and their favourite dishes to prepare when Amari joined the discussion."
}
{
  "correct_answer": "Desmond owned the pet named Mittens and Remington owned the pet named Feathers.",
  "wrong_answer": "Desmond owned the pet named Mittens and Remington owned the pet named Feathers."
}
{
  "correct_answer": "No, the topic of influential figures in their understanding of feminism was not revisited in the conversation with Julius.",
  "wrong_answer": "No, the topic of influential figures in their understanding of feminism was not revisited in the conversation with Julius."
}
{
  "correct_answer": "The conversation shifted to running and cardio workouts after Aidan joined.",
  "wrong_answer": "The conversation shifted to running and cardio workouts after Aidan joined."
}
{
  "correct_answer": "They started discussing the concept of intersectionality after Juan's arrival.",
  "wrong_answer": "They started discussing the concept of intersectionality after Juan's arrival."
}
{
  "correct_answer": "Miguel",
  "wrong_answer": "Miguel"
}
{
  "correct_answer": "Jimmy's family emphasized the values of honesty and hard work.",
  "wrong_answer": "Jimmy's family emphasized the values of honesty and hard work."
}
{
  "correct_answer": "Yankees",
  "wrong_answer": "Yankees"
}
{
  "correct_answer": "Jett deals with diabetes.",
  "wrong_answer": "Jett deals with diabetes."
}
{
  "correct_answer": "Brian encountered a bear while hiking.",
  "wrong_answer": "Brian encountered a bear while hiking."
}

which means that there are some items where the factQA field has identical values for correct_answer and wrong_answer. Is this an error in the dataset?

skywalker023 commented 3 months ago

Thanks for filing the issue! Can you please check whether these questions are from the the inaccessible samples or accessible samples?

dniku commented 3 months ago

Can you please check whether these questions are from the the inaccessible samples or accessible samples?

These are all fact questions, so I'm not sure how they can be inaccessible.

skywalker023 commented 3 months ago

Oh the fact questions are only used to measure token F1 scores, so we only use the correct_answer!

dniku commented 3 months ago

Since you publish both correct_answer and wrong_answer, do you think it would be a good idea to make both valid?

skywalker023 commented 3 months ago

What do you mean by making them both valid? Can you please explain a bit more?

dniku commented 3 months ago

I mean making wrong_answer a possible wrong answer to the question. Currently you are publishing this field as part of the dataset, but it appears that it is not guaranteed to contain a string that could be an incorrect answer, given that in a few cases it is equal to correct_answer.

This also leads me to suspect that wrong_answer could not be an actual wrong answer in other cases as well, but those would not be as trivial as the ones that I reported here. Would it be possible to check all strings that you publish as wrong_answer to be actual wrong answers?

skywalker023 commented 3 months ago

Gotcha, I went through those instances and now I understand where the misunderstanding came from. The reason why those correct_answer and wrong_answer are the same is because they are from accessible instances (i.e., conversation). The accessible instances are where there is no information asymmetry regarding the fact question. In other words, the question can be answered the same way, whether it's based on the conversation part in which character X was absent or the one in which X was involved. This is why those instances were labeled as accessible when we built the dataset. So all fact questions in accessible instances actually have very similar correct_answer and wrong_answer. And looks like there are even identical ones as you've reported. Hope this clear things up!

Maybe I should've made the wrong_answer to contain empty strings for the fact questions or label them with a different name to minimize misunderstandings. Sorry for the confusion 🙏🏻 Please let me know if there are other issues!