sotopia-lab / sotopia

Sotopia: an Open-ended Social Learning Environment (ICLR 2024 spotlight)
https://docs.sotopia.world
MIT License
127 stars 16 forks source link

Add a test case for a single dimension evaluation #123

Closed bugsz closed 12 hours ago

bugsz commented 1 week ago

πŸ“‘ Description

I provide a test case for the issue mentions in #89. Specifically this is done by adding a dummy evaluator with only one goal evaluation dimension, and add a new option for the response_format in evaluator. Besides, I use the same format as in real Sotopia simulation in testing, which makes the test case aligned with the actual evaluation.

bugsz commented 1 week ago

By the way currently I am using assert False (so the current pytest is definitely not passing) to see the output. However, I do not know how to check if there is a reasoning part. Does anyone have an idea?

ProKil commented 1 week ago

@XuhuiZhou Could you help check this? I think this is basically a prompting issue? Maybe by changing the description of the goal dimension, it should work better?

XuhuiZhou commented 5 days ago

@bugsz @ProKil Okay I fixed this bug, basically, the original instruction is a bit ambiguous. But they somehow magically work when they stick together

codecov[bot] commented 5 days ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 62.01%. Comparing base (8d9b9be) to head (269b4f2). Report is 2 commits behind head on main.

@@            Coverage Diff             @@
##             main     #123      +/-   ##
==========================================
+ Coverage   60.03%   62.01%   +1.98%     
==========================================
  Files          47       55       +8     
  Lines        2402     2733     +331     
==========================================
+ Hits         1442     1695     +253     
- Misses        960     1038      +78     
Files Coverage Ξ”
sotopia/envs/evaluators.py 91.07% <100.00%> (+0.62%) :arrow_up:
tests/envs/test_evaluators.py 100.00% <100.00%> (ΓΈ)

... and 13 files with indirect coverage changes

ProKil commented 4 days ago

@bugsz Could you check if this fixes your problem?