sierra-research / tau-bench

Code and Data for Tau-Bench
MIT License
175 stars 19 forks source link

Concerns about scoring expected outputs #12

Open mattzh72 opened 4 days ago

mattzh72 commented 4 days ago

Firstly, thank you for contributing tau-bench!

I ran:

python3 run.py --agent-strategy tool-calling --env retail --model gpt-4o --model-provider openai --user-model gpt-4o --user-model-provider openai --user-strategy llm --max-concurrency 10 --task-ids 2

And got a perfectly scored output:

🏆 Average reward: 1.0
📈 Pass^k
  k=1: 1.0

For a response that seems wrong. Digging deeper in the code, I see that it is being scored via this snippet in base.py:

        if len(self.task.outputs) > 0:
            # check outputs
            r_outputs = 1.0
            outputs = {}
            for output in self.task.outputs:
                found = False
                for action in self.actions:
                    if (
                        action.name == RESPOND_ACTION_NAME
                        and output.lower()
                        in action.kwargs["content"].lower().replace(",", "")
                    ):
                        found = True
                        break
                outputs[output] = found
                if not found:
                    r_outputs = 0.0
                    reward = 0.0
            info = RewardOutputInfo(r_outputs=r_outputs, outputs=outputs)

If I'm reading correctly, it's checking that the output.lower() (in this case, 10) appears somewhere in the response from the agent. In my case, the agent responded:

"content": "We currently have 12 different T-shirt options available in our online store. Here are some of the options:\n\n1. **Color:** Blue, **Size:** M, **Material:** Cotton, **Style:** Crew Neck\n2. **Color:** Purple, **Size:** XL, **Material:** Cotton, **Style:** Crew Neck\n3. **Color:** Red, **Size:** XXL, **Material:** Cotton, **Style:** Crew Neck\n4. **Color:** Black, **Size:** XXL, **Material:** Polyester, **Style:** V-Neck\n5. **Color:** Black, **Size:** S, **Material:** Polyester, **Style:** Crew Neck\n6. **Color:** Purple, **Size:** S, **Material:** Polyester, **Style:** V-Neck\n7. **Color:** Blue, **Size:** S, **Material:** Cotton, **Style:** V-Neck\n8. **Color:** Black, **Size:** XXL, **Material:** Cotton, **Style:** Crew Neck\n9. **Color:** Red, **Size:** L, **Material:** Cotton, **Style:** V-Neck\n10. **Color:** Black, **Size:** XL, **Material:** Cotton, **Style:** Crew Neck\n\nPlease note that availability may vary, and some options might not be in stock at the moment.",

It says there are 12 T-shirt options, but manages to pass the test case because it is enumerating a list up to 10. This seems problematic? Either the expected output (10) is wrong, or the way the scoring is being done is not very robust. On a similar train of thought, it seems to reward models that are more verbose, as the odds of hitting the correct output string increases.

I attached my output below, thank you!

tool-calling-gpt-4o-0.0_range_0--1_user-gpt-4o-llm_1025151105.json

cc @noahshinn @ysymyth

noahshinn commented 4 days ago

Yea agree. It's not the perfect semantic equality check that we want. I'm happy to add an llm check as an opt-in grading strategy, or open to feedback from you as well.