An example entry in the EVOUNA dataset is as follow:
{'question': 'Who won the World Professional Snooker Championship six times in the 1970s?',
'golden_answer': 'Ray Reardon',
'answer_fid': 'Ray Reardon',
'judge_fid': True,
'answer_gpt35': 'Steve Davis won the World Professional Snooker Championship six times in the 1970s.',
'judge_gpt35': False,
'answer_chatgpt': 'The snooker player who won the World Professional Snooker Championship six times in the 1970s is Ray Reardon.',
'judge_chatgpt': True,
'answer_gpt4': 'The individual who won the World Professional Snooker Championship six times in the 1970s is Ray Reardon.',
'judge_gpt4': True,
'answer_newbing': 'The Welsh snooker player\xa0Ray Reardon\xa0won the World Professional Snooker Championship six times in the 1970s. He was a dominant force in the world of snooker during that decade, and his success helped to popularize the sport.',
'judge_newbing': True,
'improper': False}
are the "judge_{x}"s humans? Or are these the result of prompting each model on their own output?
An example entry in the EVOUNA dataset is as follow:
are the "judge_{x}"s humans? Or are these the result of prompting each model on their own output?