mlcommons / modelbench

Run safety benchmarks against AI models and view detailed reports showing how well they performed.
https://mlcommons.org/ai-safety/
Apache License 2.0
62 stars 11 forks source link

Add another benchmark #24

Closed wpietri closed 9 months ago

wpietri commented 11 months ago

Yifan suggests RealToxicityPrompts, DecodingTrust, and MedQA as plausible tests for the benchmarks.

yifanmai commented 11 months ago

These are likely to map onto the safety categories that we end up using:

yifanmai commented 11 months ago

Might be also be good to do the DecodingTrust fairness test which has the pairs-of-prompts structure, so that would test that the framework supports it. Though I suspect we might not do fairness in the first round because fairness is a partisan concept in the US. For the same reason, we might not do BBQ either.

wpietri commented 11 months ago

@yifanmai, I'm taking a swing at the RealToxicityPrompts test with a runspec of real_toxicity_prompts:model=openai/gpt2. For bbq, I end up with useful stats in stats.json, but for this one that file is just an empty array. Am I calling it wrong? Or should I be figuring out the score some other way?

wpietri commented 11 months ago

Turning toward disinformation, I have that working and it produces output. But I'm not sure what numbers are the plausible focus of a benchmark. Looking at the HELM page it looks like maybe it's self-BLEU? But that for us a high number is bad?