Closed wpietri closed 9 months ago
These are likely to map onto the safety categories that we end up using:
Might be also be good to do the DecodingTrust fairness test which has the pairs-of-prompts structure, so that would test that the framework supports it. Though I suspect we might not do fairness in the first round because fairness is a partisan concept in the US. For the same reason, we might not do BBQ either.
@yifanmai, I'm taking a swing at the RealToxicityPrompts test with a runspec of real_toxicity_prompts:model=openai/gpt2
. For bbq, I end up with useful stats in stats.json, but for this one that file is just an empty array. Am I calling it wrong? Or should I be figuring out the score some other way?
Yifan suggests RealToxicityPrompts, DecodingTrust, and MedQA as plausible tests for the benchmarks.