For a response that seems wrong. Digging deeper in the code, I see that it is being scored via this snippet in base.py:
if len(self.task.outputs) > 0:
# check outputs
r_outputs = 1.0
outputs = {}
for output in self.task.outputs:
found = False
for action in self.actions:
if (
action.name == RESPOND_ACTION_NAME
and output.lower()
in action.kwargs["content"].lower().replace(",", "")
):
found = True
break
outputs[output] = found
if not found:
r_outputs = 0.0
reward = 0.0
info = RewardOutputInfo(r_outputs=r_outputs, outputs=outputs)
If I'm reading correctly, it's checking that the output.lower() (in this case, 10) appears somewhere in the response from the agent. In my case, the agent responded:
"content": "We currently have 12 different T-shirt options available in our online store. Here are some of the options:\n\n1. **Color:** Blue, **Size:** M, **Material:** Cotton, **Style:** Crew Neck\n2. **Color:** Purple, **Size:** XL, **Material:** Cotton, **Style:** Crew Neck\n3. **Color:** Red, **Size:** XXL, **Material:** Cotton, **Style:** Crew Neck\n4. **Color:** Black, **Size:** XXL, **Material:** Polyester, **Style:** V-Neck\n5. **Color:** Black, **Size:** S, **Material:** Polyester, **Style:** Crew Neck\n6. **Color:** Purple, **Size:** S, **Material:** Polyester, **Style:** V-Neck\n7. **Color:** Blue, **Size:** S, **Material:** Cotton, **Style:** V-Neck\n8. **Color:** Black, **Size:** XXL, **Material:** Cotton, **Style:** Crew Neck\n9. **Color:** Red, **Size:** L, **Material:** Cotton, **Style:** V-Neck\n10. **Color:** Black, **Size:** XL, **Material:** Cotton, **Style:** Crew Neck\n\nPlease note that availability may vary, and some options might not be in stock at the moment.",
It says there are 12 T-shirt options, but manages to pass the test case because it is enumerating a list up to 10. This seems problematic? Either the expected output (10) is wrong, or the way the scoring is being done is not very robust. On a similar train of thought, it seems to reward models that are more verbose, as the odds of hitting the correct output string increases.
Yea agree. It's not the perfect semantic equality check that we want. I'm happy to add an llm check as an opt-in grading strategy, or open to feedback from you as well.
Firstly, thank you for contributing tau-bench!
I ran:
And got a perfectly scored output:
For a response that seems wrong. Digging deeper in the code, I see that it is being scored via this snippet in
base.py
:If I'm reading correctly, it's checking that the
output.lower()
(in this case,10
) appears somewhere in the response from the agent. In my case, the agent responded:It says there are 12 T-shirt options, but manages to pass the test case because it is enumerating a list up to 10. This seems problematic? Either the expected output (
10
) is wrong, or the way the scoring is being done is not very robust. On a similar train of thought, it seems to reward models that are more verbose, as the odds of hitting the correct output string increases.I attached my output below, thank you!
tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1025151105.json
cc @noahshinn @ysymyth