sotopia-lab / sotopia

Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)
https://docs.sotopia.world
MIT License
127 stars 16 forks source link

Improve the benchmark by evaluating multiple models and display the results #126

Open bugsz opened 6 days ago

bugsz commented 6 days ago

Closes #

📑 Description

As the title suggests:

  1. Support evaluating multiple models at the same time, simply by sotopia benchmark-all --model-list gpt-4o --model-list gpt-3.5-turbo, or just go ahead with the default model names.
  2. Support displaying and saving the results in format in https://github.com/sotopia-lab/sotopia-space/blob/main/data_dir/models_vs_gpt35.jsonl by sotopia benchmark-display. (Seems there is no requirement for pandas so I am not sure how to display in a structured way in CLI)

✅ Checks

ℹ Additional Information

codecov[bot] commented 6 days ago

Codecov Report

Attention: Patch coverage is 14.58333% with 41 lines in your changes missing coverage. Please review.

Project coverage is 61.03%. Comparing base (701f2a8) to head (8fb172a).

@@            Coverage Diff             @@
##             main     #126      +/-   ##
==========================================
- Coverage   61.71%   61.03%   -0.69%     
==========================================
  Files          55       55              
  Lines        2714     2756      +42     
==========================================
+ Hits         1675     1682       +7     
- Misses       1039     1074      +35     
Files Coverage Δ
sotopia/cli/benchmark/benchmark.py 21.17% <14.58%> (-2.27%) :arrow_down:

... and 1 file with indirect coverage changes