mlcommons modelbench issues

mlcommons / modelbench

Run safety benchmarks against AI models and view detailed reports showing how well they performed.

https://mlcommons.org/ai-safety/

Apache License 2.0

62 stars 11 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Bump the prod-deps group with 4 updates

#714 dependabot[bot] opened 9 hours ago
1
More runtime improvements

#713 wpietri closed 1 day ago
1
Tweak the grading function

#712 rogthefrog closed 2 days ago
1
More runtime improvements

#711 wpietri closed 3 days ago
1
Hopefully improve reliability and debugging output a bit.

#710 wpietri closed 3 days ago
1
New new grading function

#709 rogthefrog opened 3 days ago
1
Peter's emergency Mistral run

#708 wpietri opened 3 days ago
0
Register Phi 3.5 moe SUT + Add "instruct" to Phi UIDs

#707 bkorycki closed 3 days ago
2
Azure plugin + Phi 3.5 mini SUT

#706 bkorycki closed 4 days ago
1
Gemini safety settings on

#705 bkorycki closed 4 days ago
1
Update bands with Kurt's newest code

#704 rogthefrog opened 4 days ago
0
Hopefully final-er standards.

#703 wpietri closed 4 days ago
1
Fix annotator cache bug in benchmark runner

#702 bkorycki closed 5 days ago
1
Set up gemini with (safety on = BLOCK_LOW_AND_ABOVE)

#701 wpietri opened 5 days ago
0
Hopefully final standards

#700 wpietri closed 5 days ago
1
Register Llama 3.1 405b Instruct SUT

#699 bkorycki closed 6 days ago
1
Retry anthropic 429 more doggedly. Add a little more to the journal.

#698 wpietri closed 6 days ago
1
Bump the prod-deps group with 2 updates

#697 dependabot[bot] closed 6 days ago
1
Use actual heldback prompts in official tests

#696 bkorycki closed 1 week ago
1
Add new claude.

#695 wpietri closed 1 week ago
1
Bump aiohttp from 3.10.10 to 3.10.11 in the pip group

#694 dependabot[bot] closed 6 days ago
1
More consistent retrying

#693 wpietri closed 1 week ago
1
Include NVIDIA SUTs in benchmark

#692 wpietri opened 1 week ago
2
Apply ensemble updates nov 13

#691 bkorycki closed 1 week ago
1
Practice/heldback prompts switch

#690 bkorycki closed 1 week ago
1
Add more checks to consistency checker

#689 bkorycki opened 1 week ago
0
Stand up mistral SUT adapter for ministral 8b instruct

#688 rogthefrog opened 1 week ago
2
productionize gpt SUT

#687 wpietri opened 1 week ago
0
A Huggingface endpoint broke; this uses the replacement endpoint.

#686 wpietri closed 1 week ago
1
Bump the prod-deps group with 5 updates

#685 dependabot[bot] closed 1 week ago
1
Trying to get Dependabot to group the PRs in one lump on a weekly basis.

#684 wpietri closed 1 week ago
1
add nvidia-nim-api plugin to plugins/

#683 zijiachen95 closed 4 days ago
1
add nvidia-nim-api plugin to plugins/

#682 zijiachen95 closed 1 week ago
1
add nvidia-nim-api plugin to plugins/

#681 zijiachen95 closed 1 week ago
1
Journal consistency checker

#680 bkorycki closed 1 week ago
2
More elaborate private tests, saner public tests.

#679 wpietri closed 1 week ago
1
Add Microsoft Phi 3.5 MoE

#678 wpietri opened 1 week ago
0
Add Microsoft Phi 3.5 mini

#677 wpietri opened 1 week ago
0
Make official Llama SUT

#676 wpietri opened 1 week ago
0
Add proper anthropic SUT

#675 wpietri opened 1 week ago
0
Bump tqdm from 4.66.5 to 4.67.0

#674 dependabot[bot] closed 1 week ago
2
Bump tomli from 2.0.2 to 2.1.0

#673 dependabot[bot] closed 1 week ago
2
Add some basic journal documentation

#672 wpietri closed 2 weeks ago
1
Final final practice calibration (with ws3-llama-guard-3-ruby v0.3)

#671 wpietri closed 2 weeks ago
1
Practice prompt calibration

#670 wpietri closed 2 weeks ago
1
Update to latest ws3 voting strategy

#669 bkorycki closed 3 weeks ago
1
update grading function per October 2024 spec

#668 rogthefrog closed 1 week ago
4
Update ensemble join method

#667 bkorycki opened 3 weeks ago
0
Add persona and persona*hazard breakdown to benchmark grading functions

#666 rogthefrog opened 3 weeks ago
0
operational improvements, round 3

#665 wpietri opened 3 weeks ago
0