Closed ijyliu closed 10 months ago
p.30 https://arxiv.org/pdf/2303.08774.pdf
Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.
We know that GSM8-K's test set is NOT in the GPT-4 training data.
It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used
Last letter is a big one
CommonSenseQA is also a good source, but might be more contaminated
Winogrande, StrategyQA also common
It would be nice to have a reading comprehension or summarization task as well
Check https://paperswithcode.com/datasets
Possibly best to finalize this after finalizing list of methods
Be sure to state in the paper how many methods used the chosen evals!
p.30 https://arxiv.org/pdf/2303.08774.pdf
Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.
We know that GSM8-K's test set is NOT in the GPT-4 training data.
It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used
Last letter is a big one
CommonSenseQA is also a good source, but might be more contaminated
Winogrande, StrategyQA also common
It would be nice to have a reading comprehension or summarization task as well
Check https://paperswithcode.com/datasets
Possibly best to finalize this after finalizing list of methods
Be sure to state in the paper how many methods used the chosen evals!