Platform • Paper • Dataset • Discord • Twitter • WeChat
Existing benchmarks for web agent tasks are either offline and static, or operate within a fully reproducible environment with limited Internet dynamics. The WebCanvas project aims to pioneer the online evaluation of web agents. Additionally, we offer a suite of toolkits for scaling and maintaining web agent data to support this endeavor. We welcome any constructive feedback on the project and look forward to partnering with you in developing agents for web tasks!
First, ensure your environment is ready by installing the necessary dependencies:
conda create -n webcanvas python=3.11
conda activate webcanvas
pip install -r requirements.txt
Before running the repos, you need to set up the required API keys as using features dependent on external APIs. Please refer to this docs.
From our experiments, the experimental environment plays a crucial role in agent performance. We recommend experimenting on a Windows server using Chrome or Firefox browser engines, preferably on servers located in the United States. Below is the experiment results on Mind2Web-Live test set.
Planning Model | IP Region | System | Browser | Completion Rate | Task Success Rate | Efficiency Score |
---|---|---|---|---|---|---|
gpt-3.5-turbo-0125 | United States | Windows | Chrome | 40.2% | 16.5% | 3.03 |
gpt-3.5-turbo-0125 | United States | Windows | Firefox | 42.1% | 20.2% | 2.79 |
gpt-3.5-turbo-0125 | United States | Linux | Chrome | 36.5% | 15.4% | 3.33 |
gpt-3.5-turbo-0125 | United Kingdom | Windows | Chrome | 23.6% | 8.65% | 7.78 |
gpt-3.5-turbo-0125 | Singapore | Windows | Chrome | 42.3% | 21.2% | 2.95 |
Register on the platform here.
First, ensure your environment variables are correctly set so that the code can access the necessary credentials and URL.
export GRAPHQL_USERNAME=your_username
export GRAPHQL_PASSWORD=your_password
If you registered using your Google account, please setup a password in the profile page on iMean Builder.
To download a file, use the following command:
python data/dataset_io.py download \
--challenge-id your_challenge_id \
--save-path /path/to/save/file
your_challenge_id
: The ID of the challenge for the download. Obtain this ID on the url link of the challenge for now. For example, the ID of Mind2Web-Live Test is "WjVIjPfpa-psiltU3oD2W"./path/to/save/file
: The path where the downloaded file will be saved.The raw data contain rich information on step level to inspire future research. However, it's not for our evaluation.
To process the raw data, run the follow command:
python data/raw_data_processor.py \
--input-file path/to/input/file \
--output-file path/to/output/file
You can run the repos with the following command:
python evaluate.py \
--global_reward_mode dom_reward \
--index -1 \
--single_task_name "Find Dota 2 game and add all DLC to cart in steam." \
--planning_text_model gpt-4o-mini \
--global_reward_text_model gpt-4o-mini
This command runs the script with DOM-based self-reward, processing the default task "Find Dota 2 game and add all DLC to cart in steam" or using the default data index -1. It also uses the GPT-3.5 Turbo model for both observation and global reward processing. The evaluation mode is controlled by the task_mode
parameter in configs/setting.toml
, allowing you to choose between batch mode and single mode(without automatic evaluation). Remember to specify your path to the test file in configs/setting.toml
.
This program supports several command-line arguments to customize its behavior:
--global_reward_mode
: Selects the method for getting global rewards.
dom_vision_reward
, dom_reward
, vision_reward
, no_global_reward
dom_reward
dom_vision_reward
: Rewards are calculated using both DOM and vision data. Currently only support GPT4v as vision model.dom_reward
: Rewards are based solely on DOM interactions. You can specify the language model you want to use for reward reasoning by parameter global_reward_text_model.vision_reward
: Rewards are derived from vision-based interactions only. Currently only support GPT4v as vision model.no_global_reward
: No global rewards are calculated.--index
: Decide which data index to start with.
-1
0,5
will process data from index 0 to 5.--single_task_name
: Defines the task name of the single task to execute.
"Find Dota 2 game and add all DLC to cart in steam."
--planning_text_model
: Specifies the model used for planning module.
gpt-4o-mini
--global_reward_text_model
: Specifies the model used for global reward reasoning.
gpt-4o-mini
Evaluating web agents in an online environment can sometimes be painful due to issues like network problems or bot tests on certain websites. Adopting an evaluation method that accommodates these issues allows for an accurate assessment of an agent's performance under specific current conditions. Additionally, we provide a more flexible interaction mode, enabling users to manually solve environmental issues and get the optimized performance of their web agents. You can simply set the interaction_mode
parameter in configs/setting.toml
to enable this feature. We will accumulate our implementation on error handling in online agent inference, and try to minimize human efforts by triggering only when exceptions occur in the following version.
IMPORTANT: You should upload the generated out.json file to participate a challenge. To upload your result, use the following command:
python data/dataset_io.py upload \
--file-path /path/to/your/file \
--challenge-id your_challenge_id \
--name your_agent_name \
--base-model your_agent_base_model
Replace the placeholders with your actual values:
/path/to/your/file
: The path to the result you want to upload.your_challenge_id
: The ID of the challenge you want to participate.your_agent_name
: The agent name for the upload.your_agent_base_model
: The agent base model information for the upload.You can also submit through our platform. We will conduct an official check on your submission to prevent cheating.
We provide a token consumption calculation functionality for evaluating the efficiency of your agent, and it is enabled automatically.
The token consumption is calculated based on the number of tokens consumed by planning module and global reward reasoning module(if applicable) during the evaluation process.
The token consumption calculation results of each experiment will be saved in the token_results
folder in JSON format.
We use the tiktoken
package to calculate the consumption of tokens. For those models whose encodings cannot be obtained, the default encoding "cl100k_base" is used. Therefore, for non-OPENAI models, the calculated tokens may have certain deviations.
The amount spent on tokens is only available when the model name is provided in the 'token_pricing' under setting.toml; otherwise, only the quantity of tokens will be counted. If you want to calculate the monetary expenditure of models not listed in 'token_pricing', you should first add the full name of the model (such as "gpt-4o-2024-05-13") to the 'pricing_models' list. Then, add the unit price of input and output for this model below the list, such as "gpt-4o-2024-05-13_input_price = 0.000005" and "gpt-4o-2024-05-13_output_price = 0.000015".
Few example results on Mind2Web-Live test set:
Planning Model | Completion Score | Task Success Rate | USD Efficiency Score |
GPT-4o-2024-05-13 | 51.4% | 28.8% | 0.142 |
Llama-3.1-405B-Instruct-Turbo | 47.8% | 24.0% | 0.174 |
Llama-3.1-70B-Instruct-Turbo | 44.8% | 20.2% | 0.031 |
GPT-4o-mini-2024-07-18 | 42.9% | 21.2% | 0.004 |
GPT-3.5-turbo-0125 | 42.5% | 17.3% | 0.092 |
You can follow instructions on this documentation about how to create your own challenging benchmark for web agents.
Currently, you need to set up a challenge(you can keep it private at first) on WebCanvas Platform to download raw data of your dataset.
If your are thinking of scaling your Web trajectory data for training and evaluation, you can contact us directly for some technical assistance.
Demo video:
We welcome contributions to WebCanvas!
Thank you for your interest in improving WebCanvas. Your contributions are greatly appreciated and essential to the growth and success of our project. Please refer to the roadmap and TODOs for promising directions.
We are building a vibrant and inclusive community around WebCanvas! Join our community to stay up-to-date with the latest developments and to contribute to the project:
We value your feedback and suggestions!
If you use this project in your research, please cite our paper:
@article{pan2024webcanvas,
title={WebCanvas: Benchmarking Web Agents in Online Environments},
author={Pan, Yichen and Kong, Dehan and Zhou, Sida and Cui, Cheng and Leng, Yifei and Jiang, Bing and Liu, Hangyu and Shang, Yanyi and Zhou, Shuyan and Wu, Tongshuang and others},
journal={arXiv preprint arXiv:2406.12373},
year={2024}
}
[^1]: Deng, Xiang, et al. "Mind2web: Towards a generalist agent for the web." Advances in Neural Information Processing Systems 36 (2024). [^2]: Zhou, Shuyan, et al. "Webarena: A realistic web environment for building autonomous agents." arXiv preprint arXiv:2307.13854 (2023). [^3]: Mialon, Grégoire, et al. "Gaia: a benchmark for general ai assistants." arXiv preprint arXiv:2311.12983 (2023). [^4]: Drouin, Alexandre, et al. "WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?." arXiv preprint arXiv:2403.07718 (2024). [^5]: Zheng, Boyuan, et al. "Gpt-4v (ision) is a generalist web agent, if grounded." arXiv preprint arXiv:2401.01614 (2024).