Choice of Evaluation - Githubissues

Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.

We know that GSM8-K's test set is NOT in the GPT-4 training data.

It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used

Last letter is a big one

CommonSenseQA is also a good source, but might be more contaminated

Winogrande, StrategyQA also common

It would be nice to have a reading comprehension or summarization task as well

Possibly best to finalize this after finalizing list of methods

Be sure to state in the paper how many methods used the chosen evals!

ijyliu / anlp23-project