Improve the benchmark by evaluating multiple models and display the results

sotopia-lab / sotopia

Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)

https://docs.sotopia.world

MIT License

127 stars 16 forks source link

Improve the benchmark by evaluating multiple models and display the results #126

Open bugsz opened 6 days ago

bugsz commented 6 days ago

Closes #

📑 Description

As the title suggests:

Support evaluating multiple models at the same time, simply by sotopia benchmark-all --model-list gpt-4o --model-list gpt-3.5-turbo, or just go ahead with the default model names.
Support displaying and saving the results in format in https://github.com/sotopia-lab/sotopia-space/blob/main/data_dir/models_vs_gpt35.jsonl by sotopia benchmark-display. (Seems there is no requirement for pandas so I am not sure how to display in a structured way in CLI)

✅ Checks

[ ] My pull request adheres to the code style of this project
[ ] My code requires changes to the documentation
[ ] I have updated the documentation as required
[ ] All the tests have passed
[ ] Branch name follows type/descript (e.g. feature/add-llm-agents)
[ ] Ready for code review

ℹ Additional Information

codecov[bot] commented 6 days ago

Codecov Report

Attention: Patch coverage is 14.58333% with 41 lines in your changes missing coverage. Please review.

Project coverage is 61.03%. Comparing base (701f2a8) to head (8fb172a).

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
- Coverage   61.71%   61.03%   -0.69%     
==========================================
  Files          55       55              
  Lines        2714     2756      +42     
==========================================
+ Hits         1675     1682       +7     
- Misses       1039     1074      +35

Files	Coverage Δ
sotopia/cli/benchmark/benchmark.py	`21.17% <14.58%> (-2.27%)`	:arrow_down:

... and 1 file with indirect coverage changes