:page_facing_up: Paper | :mailbox: Contact
Introducing ToolTalk a benchmark for evaluating Tool LLMs in a conversational setting.
ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot, an increasingly popular paradigm for everyday users to harness the power of LLMs. ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations. We annotate these conversations to contain a ground truth usage of 28 unique tools belonging to 7 themed "plugins".
Evaluation consists of prompting an LLM to predict the correct sequence of tools after every user utterance in a conversation. Thus, evaluating on a single conversation requires an LLM to correctly predict multiple sub-tasks. Predictions are compared against the ground truth to determine success for a single conversation.
We evaluate two chatbots on ToolTalk powered by gpt-3.5-turbo-0613 and gpt-4-0613 implemented by using the chat completions API from OpenAI.
Model | ToolTalk | Success rate | Precision | Recall | Incorrect Action Rate |
---|---|---|---|---|---|
GPT-3.5 | Easy | 85.7% | 42.4% | 89.3% | 5.0% |
GPT-4 | Easy | 92.8% | 69.2% | 96.4% | 3.8% |
GPT-3.5 | Hard | 26.0% | 54.6% | 69.7% | 23.9% |
GPT-4 | Hard | 50.0% | 74.9% | 79.0% | 25.1% |
ToolTalk can be setup using the following commands. Install local package with dev dependencies to enable unit tests.
pip install -r requirements.txt
pip install -e ".[dev]"
To verify that the installation was successful, run the unit tests.
pytest tests
The results on GPT-3.5-turbo and GPT-4 can be reproduced using the following commands. This requires having access to
OpenAI's API. The results will be saved in the results
folder. The script caches intermediary results, so it can be
re-run if it is interrupted for any reason.
export OPENAI_API_KEY=<your key>
bash evaluate_gpt35turbo.sh
bash evaluate_gpt4.sh
Your results should look something like the number above, there will be some variance due to both models having non-deterministic results.
To generate new scenarios, you can use the following command.
python -m tooltalk.generation.scenario_generator --prompt src/prompts/scenario_template.md --output_dir output/scenarios
The easiest way to evaluate on new models would be to create a new Predictor
class that inherits from tooltalk.evaluation.tool_executor.BaseAPIPredictor
.
For an example of how to do this, see tooltalk.evaluation.tool_executor.GPT3Predictor
and tooltalk.evaluation.evaluate_openai.OpenAIPredictor
.
@article{farn2023tooltalk,
title={ToolTalk: Evaluating Tool-Usage in a Conversation Setting},
author={Nicholas Farn and Richard Shin},
year={2023},
journal={arXiv preprint arXiv:2311.10775},
}
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.