ijyliu / anlp23-project

An empirical study of the costs and practicalities of prompt engineering techniques on standard and novel benchmarks
0 stars 0 forks source link

Choice of Evaluation #41

Closed ijyliu closed 10 months ago

ijyliu commented 10 months ago

p.30 https://arxiv.org/pdf/2303.08774.pdf

Contamination is very serious for the GRE and less serious for the SAT. For these exams it's extremely hard to find questions not posted online.

We know that GSM8-K's test set is NOT in the GPT-4 training data.

It would be nice to look at the most common original method paper evaluations. I think GSM8K is the most common but unclear about others used

Last letter is a big one

CommonSenseQA is also a good source, but might be more contaminated

Winogrande, StrategyQA also common

It would be nice to have a reading comprehension or summarization task as well

Check https://paperswithcode.com/datasets

Possibly best to finalize this after finalizing list of methods

Be sure to state in the paper how many methods used the chosen evals!