wangcunxiang / QA-Eval

The repository for paper <Evaluating Open-QA Evaluation>
Apache License 2.0
20 stars 0 forks source link

Question: Need clarififcation on dataset #3

Closed luckysusanto closed 1 month ago

luckysusanto commented 1 month ago

An example entry in the EVOUNA dataset is as follow:

{'question': 'Who won the World Professional Snooker Championship six times in the 1970s?',
  'golden_answer': 'Ray Reardon',
  'answer_fid': 'Ray Reardon',
  'judge_fid': True,
  'answer_gpt35': 'Steve Davis won the World Professional Snooker Championship six times in the 1970s.',
  'judge_gpt35': False,
  'answer_chatgpt': 'The snooker player who won the World Professional Snooker Championship six times in the 1970s is Ray Reardon.',
  'judge_chatgpt': True,
  'answer_gpt4': 'The individual who won the World Professional Snooker Championship six times in the 1970s is Ray Reardon.',
  'judge_gpt4': True,
  'answer_newbing': 'The Welsh snooker player\xa0Ray Reardon\xa0won the World Professional Snooker Championship six times in the 1970s. He was a dominant force in the world of snooker during that decade, and his success helped to popularize the sport.',
  'judge_newbing': True,
  'improper': False}

are the "judge_{x}"s humans? Or are these the result of prompting each model on their own output?

wangcunxiang commented 1 month ago

by humans

luckysusanto commented 1 month ago

Thanks for the swift reply! ^^