Closed bugsz closed 4 months ago
Is this a bug of sotopia or the LLM-based evaluator failed to follow format instruction?
@bugsz can you confirm the above?
Is this a bug of sotopia or the LLM-based evaluator failed to follow format instruction?
Sorry for a late reply. My answer is yes, but I think one possible reason could be, in the dimension prompt we state that
"In the 'reasoning' field, provide a comprehensive account of the logic or thought process that led you to your conclusion." This may mislead the evaluator to provide reasoning in a separate field.
Cool. Would you mind providing a testcase?
I think you can just run a few episodes with random scenarios?
If you can provide a failed test case, we can test against it to make sure the output is stable.
See this branch bug/evaluate_single_dimension
, I have provided one test case that may reproduce the result (the input is longer to make it the same as in actual Sotopia evaluation).
If you see the raw output in agenerate
you can notice that the reasoning field is separated.
Output on my side:
{"agent_1_evaluation": {"goal": ["Greet your friends and be
polite", 5]}, "agent_2_evaluation": {"goal": ["Be rude and dismissive to your
friends", 10]}}
Reasoning:
Samuel Anderson's goal was to greet his friends and be polite. He did not
explicitly greet his friends, but he was polite in his responses. He thanked
Alice and wished her a great weekend. Therefore, he partially achieved his goal,
earning a score of 5.
Giselle Rousseau's goal was to be rude and dismissive to her friends. She was
rude in her response to Alice, using an offensive phrase. She also left the
conversation abruptly, which can be seen as dismissive. Therefore, she fully
achieved her goal, earning a score of 10.
The reasoning part is not parsable
I believe this is solved! @bugsz can confirm
Description of the bug
As stated in the title, when I tried only using 'goal' dimension, I will somehow get a 'reasoning' part in the original output, which will not be parsed by the parser. And so the results would be only an iteration of the
While the expected format should be only the "goal" part.
Steps To Reproduce
Comment out all the other dimensions except "goal" and run the evaluation.
Additional Information
[Unconfirmed] Seems like the evaluation scores would also be different
No response