hkust-nlp / AgentBoard

An Analytical Evaluation Board of Multi-turn LLM Agents
219 stars 22 forks source link

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

![Data License](https://img.shields.io/badge/Data%20License-GPL--2.0-blue.svg) ![Code License](https://img.shields.io/badge/Code%20License-Apache--2.0-blue.svg) ![Python 3.8+](https://img.shields.io/badge/python-3.8.13-blue.svg) [![slack badge](https://img.shields.io/badge/Slack-Join-blueviolet?logo=slack&)](https://join.slack.com/t/agentboard/shared_invite/zt-28ks1f1er-DzpwLKa41p_RArKnu2yimA)
🌐 Website | πŸ† Leaderboard | πŸ“š Data | πŸ“ƒ Paper | πŸ“Š Panel

What's New

Introduction

AgentBoard emphasizes analytical evaluation for Large Language Models (LLMs) as generalist agents to perceive and act within various environments. It outlines four principles for constructing a benchmark to evaluate LLMs as generalist agents:

  1. Task Diversity: AgentBoard incorporates 9 distinct tasks to comprehensively understand the generalist ability of LLM agents, which is built upon LLM's extensive knowledge base and exceptional scenario comprehension.
  2. Multi-round Intercation: AgentBoard provides multi-round interaction between agents and environment, which is necessary to reflect the evolutionary nature of human intelligence, which continuously receives information and adapts towards the environment.
  3. Partially-Observable Environments: In AgentBoard, the complete state of the environment is not available to the agent, which assesses agent world modeling ability as additional knowledge needs to be acquired through online exploration.
  4. Analytical Evaluation: AgentBoard is a systematic evaluation platform: it includes a user-friendly script to construct goal-oriented reflex agents for a range of models, and features a panel for visualizing and interpreting results across multiple dimensions of agent proficiency, including fine-grained progress rates, grounding accuracy, performance breakdown for hard and easy examples, long-range in- teractions, detailed performance across various sub-skills, and trajectory with friendly visualization

Table of Contents

Click to expand the table of contents - [What's New](#whats-new) - [Introduction](#introduction) - [πŸš€ Quick Start](#-quick-start) - [Setup Environment](#setup-environment) - [Evaluate Models](#evaluate-models) - [Launch AgentBoard Analytical Evaluation Panel](#launch-agentboard-analytical-evaluation-panel) - [Data](#data) - [Data Overview](#data-overview) - [Download Link](#download-link) - [Evaluation Details](#evaluation-details) - [Evaluation Preparation](#evaluation-preparation) - [Internet Access](#internet-access) - [Environment Preparation](#environment-preparation) - [Running Proprietary Models](#running-proprietary-models) - [For Tasks except WebShop](#for-tasks-except-webshop) - [For WebShop](#for-webshop) - [Running Open-source Models](#running-open-source-models) - [LLM Customization](#llm-customization) - [Agent Customization](#agent-customization) - [Runtime Estimation](#runtime-estimation) - [️Citation](#️citation) - [License](#license)

πŸš€ Quick Start

Here we provide a quick start guide to evaluate LLM agents on AgentBoard within 30 minutes.

Setup Environment

We provide both local setup (recommended) and docker as follows:

Click to expand local setup procedures (~ 15 minutes). Setup with a setup.sh: **Step 1. Create a conda environment** ```shell conda create -n ${YOUR_ENV_NAME} python=3.8.13 # python version should be 3.8.13 conda activate ${YOUR_ENV_NAME} ``` **Step 2. Git clone this repo** ```shell git clone https://github.com/hkust-nlp/AgentBoard.git ``` **Step 3. Download the data from huggingface** ```shell # Download the data and move it to the project root dir cd AgentBoard mkdir data wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz tar -zxvf data.tar.gz ``` **Step 4. Set up the environment for tasks except WebArena** ```shell INSTALL_WEBARENA=false bash ./setup.sh # After running the above command, the env will support other tasks than WebArena ``` **Step 5. Set up the environment for WebArena** ```shell # Please check whether the dubs and Xvfb are installed before building it # For Ubuntu or Debian dpkg -l | grep dbus # will return the info systemctl status dbus # will return the status(active (running)) dpkg -l | grep xvfb # will return the info #-----------------------------------------------------------------------# # For CentOS yum list installed | grep Xvfb # will return the Xvfb info systemctl status dbus # will return the status(active (running)) dnf list installed | grep dbus # will return the dbus info ``` If so, you may install the webarena environment directly. ```shell INSTALL_WEBARENA=true bash ./setup.sh ``` If not, please jump to Step 6 or [Installation by Docker](#52-installation-by-docker) **(Additional) Step 6. Install the dubs and Xvfb** ```shell # You must use the sudo permission to do the following: # For Ubuntu or Debian # Install and start the dbus service apt-get install dbus /etc/init.d/dbus start # Install ans start the Xvfb sudo apt-get update sudo apt-get install xvfb INSTALL_WEBARENA=true bash ./setup.sh #--------------------------------------------------------# # For Centos # Install and start the dbus service yum install -y dbus-x11 /etc/init.d/dbus start # Install ans start the Xvfb yum update yum install -y Xvfb INSTALL_WEBARENA=true bash ./setup.sh ```
Click to expand docker setup procedures. (~12G, 5 minutes) Docker info: CentOS **Step 1. Pull the docker image and run docker locally** ```shell docker pull zzh202121/agentboard:0117 docker run -itd \ --gpus all \ --network host \ --name agent_space \ --shm-size 64gb \ -v /MODEL_PATH:/model_download \ -v /DATA_PATH:/data \ zzh202121/agentboard:0117 \ /bin/bash docker attach agent_space # YOUR_CONTAINER_NAME ``` **Step 2. activate the env** ```shell conda activate agentboard ``` **Step 3. Download the code and data** ```shell git clone https://github.com/hkust-nlp/AgentBoard.git # clone repo # Download the data and move it to the project root dir cd AgentBoard mkdir data wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz tar -zxvf data.tar.gz ``` **Step 3. Build search engine index(For WebShop)** ```shell cd ./agentboard/environment/WebShop/search_engine mkdir -p resources resources_100 resources_1k resources_100k python convert_product_file_format.py # convert items.json => required doc format mkdir -p indexes ./run_indexing.sh cd ../../../ ``` **Step 4. Start web service(For Webarena)** ```shell /etc/init.d/dbus start # start dbus Xvfb :99 -screen 0 1280x720x24 & # start xvfb display export DISPLAY=:99 python -m playwright install ```

Setup Environment Variables in AgentBoard/.env

Environment Variables needed for AgentBoard include:

PROJECT_PATH = {path to project}/AgentBoard

ANTHROPIC_API_KEY=...
OPENAI_API_KEY=...

TODO_KEY=...
MOVIE_KEY=...
SHEET_EMAIL=...

WANDB_API_KEY=...
Click to expand API key setup procedures. **Variables 1: API keys for Tool tasks** Since API keys for **Tool** tasks are private, we do not provide them in this repo. Please follow this detailed [guide](./assets/api_keys_tool.md) to get API keys for **Tool** tasks. **Variables 2: Weights&Bias key for AgentBoard Online Visualization** Please paste `WANDB_API_KEY` from here [guide](https://wandb.ai/authorize) in `.env` file to login Weights&Bias for AgentBoard Visulization. **Variables 3: API keys for Proprietary models** **⚠️ You don't need to setup API keys for models you don't want to use.** If you use OpenAI models, please put your API keys in `.env` file. ```shell OPENAI_API_TYPE="open_ai" OPENAI_API_KEY=${YOUR_OPENAI_API_KEY} ``` If you use Anthropic models, please put your API keys in `.env` file. ```shell ANTHROPIC_API_KEY=${YOUR_ANTHROPIC_API_KEY} ```

Evaluate Models

Example script for GPT-3.5-Turbo:

python agentboard/eval_main.py \
    --cfg-path eval_configs/main_results_all_tasks.yaml \
    --tasks alfworld \
    --model gpt-3.5-turbo-0613 \
    --wandb \
    --log_path ./results/gpt-3.5-turbo-0613 \
    --project_name evaluate-gpt-35-turbo-0613 \
    --baseline_dir ./data/baseline_results

We now offer configuration for 12 SOTA LLM models (gpt-4,gpt-3.5-turbo-0613, text-davinci-003,claude2,deepseek-67b,lemur-70b, mistral-7b,codellama-13b(34b),llama2-13b(70b),vicuna-13b-16k) and a simple reflex agent based on act-only prompting. You could also customize your own agents and LLMs. Models supported by vLLM should be generally supported in AgentBoard, while different models may require specific prompt templates.

Launch AgentBoard Analytical Evaluation Panel

AgentBoard integrates illustrative Weights&Bias visualization to help researchers better systematically analyze LLM agents. You can simply turn on --wandb switch in the arguments and customize the project_name and baseline_dir of your wandb project as the evaluation command above.

Before running, you need to setup wandb login or environment variable as instructed in quick-start. The visualization results would be both stored offline at \wandb. Normally after executing the evaluation command, you can visualize the live AgentBoard panel online at https://wandb.ai/{your_wandb_id}/{project_name}. We provide example WandB logging pages for GPT-4, GPT-3.5-Turbo, and DeepSeek-67b.

Note that if your run is not logged online (on a cluster without internet), you could later sync local runs to wandb online with wandb sync [OPTIONS] [PATH].. as detailed in wandb docs. For more information about the features of the AgentBoard panel, Please kindly check this Blog for more information.

Local log files

In addition to online results viewing, local logs are automatically stored in {log_path}. In WebArena, we additionally support more detailed trajectory files, including web page screenshots and network traffic records.

Log file organization: ``` {log_path} β”œβ”€β”€ logs # detailed example-wise logs for each task β”‚ β”œβ”€β”€ webarena_tracks # WebArena provided rendered HTML files of the execution trace and a './trace' folder which is automatically generated with Playwright β”‚ β”‚ β”œβ”€β”€ traces β”‚ β”‚ β”‚ β”œβ”€β”€ 102.zip β”‚ β”‚ β”œβ”€β”€ render_102.html β”‚ β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ alfworld.jsonl # each line is a json dictionary logging the statistics, trajectory, and prompt for each example β”‚ β”œβ”€β”€ babyai.jsonl β”‚ β”œβ”€β”€ ... β”œβ”€β”€ all_results.txt # overall metrics for each task β”œβ”€β”€ dimension.txt # agent capability dimensional scores for current LLM agent β”œβ”€β”€ alfworld.txt # a general log for example-wise statisitcs for each task β”œβ”€β”€ babyai.txt └── ... ```

Data

Data Overview

AgentBoard is composed of 9 diverse tasks which can be divided into 4 types, including Embodied AI, Game, Web, and Tool:

Embodied AI Game Web Tool
- AlfWorld - ScienceWorld - BabyAI - Jericho - PDDL - WebShop - WebArena - Tool-Query - Tool-Operation

To help researchers quickly understand evaluation data of each task, we provide Dataset Viewer at Huggingface Dataset: πŸ€— AgentBoard.

Note: Please download the dataset from the link provided below for the reason that the data in Dataset Viewer is not complete.

Download Link

You can download the whole evaluation data by running the following command:

wget https://huggingface.co/datasets/hkust-nlp/agentboard/resolve/main/data.tar.gz

Please uncommpress the file and move the data to AgentBoard/data.

cd AgentBoard
mkdir data
tar -zxvf data.tar.gz

The file structure of evaluation data is as follows:

Click to expand the file structure ``` data β”œβ”€β”€ baseline_results β”œβ”€β”€ alfworld β”‚ β”œβ”€β”€ alfred.pddl # additional data for alfworld β”‚ β”œβ”€β”€ alfred.twl2 # additional data for alfworld β”‚ β”œβ”€β”€ json_2.1.1 # additional data for alfworld β”‚ └── test.jsonl β”œβ”€β”€ babyai β”‚ └── test.jsonl β”œβ”€β”€ jericho β”‚ β”œβ”€β”€ test.jsonl β”‚ └── z-machine-games-master # additional data for jericho β”œβ”€β”€ pddl β”‚ └── test.jsonl β”œβ”€β”€ scienceworld β”‚ └── test.jsonl β”œβ”€β”€ tool-operation β”‚ └── test.jsonl β”œβ”€β”€ tool-query β”‚ β”œβ”€β”€ academia # additional data for academia tool β”‚ └── test.jsonl β”œβ”€β”€ webarena β”‚ └── test.jsonl └── webshop └── test.jsonl ``` **We also provide baseline run loggings in `data/baseline_results`, which can be used for visualization in our panel. **

Evaluation Details

Evaluation Preparation

Internet Access

For regions with Internet restrictions, to evaluate the Tool-Query, Tool-Operation and WebArena tasks, please make sure that the machine can access the Internet.

You can check whether you have network issues by observing the output during the execution process.

Environment Preparation

We provide two ways to install the environment of AgentBoard, as specified in QuickStart.

Running Proprietary Models

In this section, we provide a script to evaluate the closed-source models on each task.

Please do not forget to set the environment variables (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY) before running the following commands.

For Tasks except WebShop

We provide a quick start script to evaluate the gpt-3.5-turbo-0613 model on alfworld task.

python agentboard/eval_main.py \
    --cfg-path eval_configs/main_results_all_tasks.yaml \
    --tasks alfworld \
    --model gpt-3.5-turbo-0613 \
    --wandb \
    --log_path ./results/gpt-3.5-turbo-0613 \
    --project_name evaluate-gpt-35-turbo-0613 \
    --baseline_dir ./data/baseline_results

Parameters:

For WebShop

First, please start the WebShop server by running the following commands:

cd ./agentboard/environment/WebShop
bash ./run_dev.sh
cd ../../..

Then, run the following command to evaluate the gpt-3.5-turbo-0613 model on webshop task.

python agentboard/eval_main.py \
    --cfg-path eval_configs/main_results_all_tasks.yaml \
    --tasks webshop \
    --model gpt-3.5-turbo-0613 \
    --wandb \
    --log_path ./results/gpt-3.5-turbo-0613 \
    --project_name evaluate-gpt-35-turbo-0613 \
    --baseline_dir ./data/baseline_results

Running Open-source Models

In AgentBoard, we have pre-supported the following 8 open-source models, by default we use vLLM to speed up inference.

Please refer to eval_configs/main_results_all_tasks.yaml for more details about these models.

To evaluate these models, you can run the following command:

python agentboard/eval_main.py \
    --cfg-path eval_configs/main_results_all_tasks.yaml \
    --tasks ${TASK_NAME} \
    --model ${OPEN_SOURCE_MODEL_NAME}

We also provide LLM customizations, please refer to LLM Customization for more details.

LLM Customization

Please refer to llm_customization.md for more details about LLM customization.

Agent Customization

Please refer to agent_customization.md for more details about agent customization.

Runtime Estimation

The evaluation runtime for a language model depends on the device/API, model, and inference architecture used. In the case of open-source LLMs, the vllm inference speed is approximately 10 times faster than the huggingface pipeline.

To estimate the total time needed for evaluation, you can run a few steps to measure the inference speed and multiply it by the total number of LLM inferences, which is within 15,000 rounds.

The general formula for estimating the total time is 4h * speed. Here are some examples of our runtime:

Model Device/API Inference Architecture Inference Speed Total-time
GPT4 azure API - 1.5s/round 5.5h
GPT-3.5-Turbo azure API - 1s/round 3h
DeepSpeed-67b 8*V100 vllm 5s/round 18.5h
Llama2-70b 8*V100 vllm 8s/round 28h
Llama2-70b 4*A100 vllm 4s/round 13.5h

️Citation

If you find this repository useful, please consider giving star and citing our paper:

@misc{ma2024agentboard,
      title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents}, 
      author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He},
      year={2024},
      eprint={2401.13178},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

License

Apache-2.0 license

The AgentBoard codebase is licensed under a Apache-2.0 License.

GPL-2.0

The AgentBoard dataset is licensed under a GNU General Public License, version 2.