quchangle1 / LLM-Tool-Survey

This is the repository for the Tool Learning survey.
https://arxiv.org/abs/2405.17935
198 stars 8 forks source link
agent autonomous-agents chatgpt gpt-4 ir large-language-models llm tool-learning tool-retrieval

Survey: Tool Learning with Large Language Models

Recently, tool learning with large language models(LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems.

This is the collection of papers related to tool learning with LLMs. These papers are organized according to our survey paper "Tool Learning with Large Language Models: A Survey".

中文: We have noticed that PaperAgent and 旺知识 have provided a brief and a comprehensive introduction in Chinese, respectively. We greatly appreciate their assistance.

Please feel free to contact us if you have any questions or suggestions!

Contribution

:tada::+1: Please feel free to open an issue or make a pull request! :tada::+1:

Citation

If you find our work helps your research, please kindly cite our paper:

@article{qu2024toolsurvey,
    author={Qu, Changle and Dai, Sunhao and Wei, Xiaochi and Cai, Hengyi and Wang, Shuaiqiang and Yin, Dawei and Xu, Jun and Wen, Ji-Rong},
    title={Tool Learning with Large Language Models: A Survey},
    journal={arXiv preprint arXiv:2405.17935},
    year={2024}
}

📋 Contents

🌟 Introduction

Recently, tool learning with large language models (LLMs) has emerged as a promising paradigm for augmenting the capabilities of LLMs to tackle highly complex problems. Despite growing attention and rapid advancements in this field, the existing literature remains fragmented and lacks systematic organization, posing barriers to entry for newcomers. This gap motivates us to conduct a comprehensive survey of existing works on tool learning with LLMs. In this survey, we focus on reviewing existing literature from the two primary aspects (1) why tool learning is beneficial and (2) how tool learning is implemented, enabling a comprehensive understanding of tool learning with LLMs. We first explore the “why” by reviewing both the benefits of tool integration and the inherent benefits of the tool learning paradigm from six specific aspects. In terms of “how”, we systematically review the literature according to a taxonomy of four key stages in the tool learning workflow: task planning, tool selection, tool calling, and response generation. Additionally, we provide a detailed summary of existing benchmarks and evaluation methods, categorizing them according to their relevance to different stages. Finally, we discuss current challenges and outline potential future directions, aiming to inspire both researchers and industrial developers to further explore this emerging and promising area.

The overall workflow for tool learning with large language models.

📄 Paper List

Why Tool Learning?

Benefit of Tools.

Benefit of Tool Learning.

How Tool Learning?

Task Planning.

Tool Selection.

Tool Calling.

Response Generation.

Benchmarks and Evaluation.

Benchmarks

Benchmark Reference Description #Tools #Instances Link Release Time
API-Bank [Paper] Assessing the existing LLMs’ capabilities in planning, retrieving, and calling APIs. 73 314 [Repo] 2023-04
APIBench [Paper] A comprehensive benchmark constructed from TorchHub, TensorHub, and HuggingFace API Model Cards. 1,645 16,450 [Repo] 2023-05
ToolBench1 [Paper] A tool manipulation benchmark consisting of diverse software tools for real-world tasks. 232 2,746 [Repo] 2023-05
ToolAlpaca [Paper] Evaluating the ability of LLMs to utilize previously unseen tools without specific training. 426 3,938 [Repo] 2023-06
RestBench [Paper] A high-quality benchmark which consists of two real-world scenarios and human-annotated instructions with gold solution paths. 94 157 [Repo] 2023-06
ToolBench2 [Paper] An instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. 16,464 126,486 [Repo] 2023-07
MetaTool [Paper] A benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools. 199 21,127 [Repo] 2023-10
TaskBench [Paper] A benchmark designed to evaluate the capability of LLMs from different aspects, including task decomposition, tool invocation, and parameter prediction. 103 28,271 [Repo] 2023-11
T-Eval [Paper] Evaluating the tool-utilization capability step by step. 15 533 [Repo] 2023-12
ToolEyes [Paper] A fine-grained system tailored for the evaluation of the LLMs’ tool learning capabilities in authentic scenarios. 568 382 [Repo] 2024-01
UltraTool [Paper] A novel benchmark designed to improve and evaluate LLMs’ ability in tool utilization within real-world scenarios. 2,032 5,824 [Repo] 2024-01
API-BLEND [Paper] A large corpora for training and systematic testing of tool-augmented LLMs. - 189,040 [[Repo]]() 2024-02
Seal-Tools [Paper] Seal-Tools contains hard instances that call multiple tools to complete the job, among which some are nested tool callings. 4,076 14,076 [Repo] 2024-05
ToolQA [Paper] It is designed to faithfully evaluate LLMs’ ability to use external tools for question answering.(QA) 13 1,530 [Repo] 2023-06
ToolEmu [Paper] A framework that uses a LM to emulate tool execution and enables scalable testing of LM agents against a diverse range of tools and scenarios.(Safety) 311 144 [Repo] 2023-09
ToolTalk [Paper] A benchmark consisting of complex user intents requiring multi-step tool usage specified through dialogue.(Conversation) 28 78 [Repo] 2023-11
VIoT [Paper] A benchmark include a training dataset and established performance metrics for 11 representative vision models, categorized into three groups using semi-automated annotations.(VIoT) 11 1,841 [[Repo]]() 2023-12
RoTBench [Paper] A multi-level benchmark for evaluating the robustness of LLMs in tool learning.(Robustness) 568 105 [Repo] 2024-01
MLLM-Tool [Paper] A system incorporating open-source LLMs and multimodal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the functionmatched tool correctly.(Multi-modal) 932 11,642 [Repo] 2024-01
ToolSword [Paper] A comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning.(Safety) 100 440 [Repo] 2024-02
SciToolBench [Paper] Spanning five scientific domains to evaluate LLMs’ abilities with tool assistance.(Sci-Reasoning) 2,446 856 [[Repo]]() 2024-02
InjecAgent [Paper] A benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks.(Safety) 17 1,054 [Repo] 2024-02
StableToolBench [Paper] A benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system.(Stable) 16,464 126,486 [Repo] 2024-03
m&m's [Paper] A benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, public APIs, and image processing modules.(Multi-modal) 33 4,427 [Repo] 2024-03
GeoLLM-QA [Paper] A novel benchmark of 1,000 diverse tasks, designed to capture complex RS workflows where LLMs handle complex data structures, nuanced reasoning, and interactions with dynamic user interfaces.(Remote Sensing) 117 1,000 [[Repo]]() 2024-04
ToolLens [Paper] ToolLens includes concise yet intentionally multifaceted queries that better mimic real-world user interactions. (Tool Retrieval) 464 18,770 [Repo] 2024-05
SoAyBench [Paper] A Solution-based LLM API-using Methodology for Academic Information Seeking 7 792 [Repo], [HF] 2024-05
ShortcutsBench [Paper] A Large-Scale Real-world Benchmark for API-based Agents 1414 7627 [Repo] 2024-07
GTA [Paper] A Benchmark for General Tool Agents 14 229 [Repo] 2024-07
WTU-Eval [Paper] A Whether-or-Not Tool Usage Evaluation Benchmark for Large Language Models 4 916 [[Repo]]() 2024-07
AppWorld [Paper] A collection of complex everyday tasks requiring interactive coding with API calls 457 750 [Repo] 2024-07
ToolSandbox [Paper] A stateful, conversational and interactive tool-use benchmark. 34 1032 [Repo] 2024-08
CToolEval [Paper] A benchmark designed to evaluate LLMs in the context of Chinese societal applications. 27 398 [Repo] 2024-08
NoisyToolBench [Paper] This benchmark includes a collection of provided APIs, ambiguous queries, anticipated questions for clarification, and the corresponding responses. - 200 [[Repo]]() 2024-09

Evaluation

Challenges and Future Directions

Other Resources