👋 Overview | 📖 Benchmarking | ⚙️ Setup | 🚀 Usage | 🔎 Citation | 📄 License
📬 Contact: libowen.ne@gmail.com, chao.peng@acm.org
📝 Check out our paper HERE !
DevBench is a comprehensive benchmark designed to evaluate LLMs across various stages of the software development lifecycle, including software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected steps under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development.
The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript), covering a wide range of domains such as machine learning, databases, web services, and command-line utilities.
DevBench includes a comprehensive and automatic evaluation suite for all tasks involved. We provide extensive acceptance and unit test cases for the implementation task 🤗. Additionally, we utilize LLM-as-a-Judge for evaluating the software design task 👩🏽⚖️. Further details on our task specifications can be found here.
We have developed a baseline agent system based on the popular multi-agent software development system, ChatDev. Special thanks to our collaborators at ChatDev!
<table border-collapse:collapse;border:none;>
<tr>
<th style="text-align:center">Pass@ Example Usage§</th>
<th style="text-align:center">Pass@ Accept. Test¶ </th>
<th style="text-align:center">Pass@ Unit Test¶</th>
<th style="text-align:center">Oracle Test§</th>
<th style="text-align:center">Oracle Test§</th>
<th style="text-align:center">Coverage$</th>
</tr>
<tr>
<td style="text-align:center">GPT-3.5-Turbo</td>
<td style="text-align:center"><em>33.3</em></td>
<td style="text-align:center">4.2</td>
<td style="text-align:center">4.3</td>
<td style="text-align:center">11.7</td>
<td style="text-align:center">28.7</td>
<td style="text-align:center">24.6(61.4)</td>
</tr>
<tr>
<td style="text-align:center">GPT-4-Turbo-1106</td>
<td style="text-align:center"><em>41.7</em></td>
<td style="text-align:center">6.9</td>
<td style="text-align:center">6.8</td>
<td style="text-align:center">25.9</td>
<td style="text-align:center">33.6</td>
<td style="text-align:center">36.7(66.7)</td>
</tr>
<tr>
<td style="text-align:center">GPT-4-Turbo-0125</td>
<td style="text-align:center"><em>41.7</em></td>
<td style="text-align:center">7.1</td>
<td style="text-align:center">8.0</td>
<td style="text-align:center">29.2</td>
<td style="text-align:center">36.5</td>
<td style="text-align:center">33.2(66.3)</td>
</tr>
<tr>
<td style="text-align:center">CodeLlama-7B-Instruct</td>
<td style="text-align:center"><em>8.3</em></td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">3.0</td>
<td style="text-align:center">3.6(71.0)</td>
</tr>
<tr>
<td style="text-align:center">CodeLlama-13B-Instruct</td>
<td style="text-align:center"><em>25.0</em></td>
<td style="text-align:center">0.6</td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">5.1</td>
<td style="text-align:center">8.6(57.6)</td>
</tr>
<tr>
<td style="text-align:center">CodeLlama-34B-Instruct</td>
<td style="text-align:center"><em>16.7</em></td>
<td style="text-align:center">0.6</td>
<td style="text-align:center">0.5</td>
<td style="text-align:center">4.5</td>
<td style="text-align:center">21.1</td>
<td style="text-align:center">25.4(72.6)</td>
</tr>
<tr>
<td style="text-align:center">DeepSeek-Coder-1.3B-Instruct</td>
<td style="text-align:center"><em>8.3</em></td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">0.1</td>
<td style="text-align:center">0.0</td>
<td style="text-align:center">5.6</td>
<td style="text-align:center">2.7(27.0)</td>
<tr>
<td style="text-align:center">DeepSeek-Coder-6.7B-Instruct</td>
<td style="text-align:center"><em>25.0</em></td>
<td style="text-align:center">2.9</td>
<td style="text-align:center">3.9</td>
<td style="text-align:center">20.5♡</td>
<td style="text-align:center">23.5</td>
<td style="text-align:center">28.2(70.6)</td>
</tr>
<tr>
<td style="text-align:center">DeepSeek-Coder-33B-Instruct</td>
<td style="text-align:center"><em>16.7</em></td>
<td style="text-align:center">4.4</td>
<td style="text-align:center">5.5</td>
<td style="text-align:center">13.6</td>
<td style="text-align:center">32.8</td>
<td style="text-align:center">35.7(79.4)</td>
</tr>
</tr>
Italic figures: test cases for the Environment Setup task are scarce compared to other tasks, therefore the results are more influenced by the randomness. §: all results are averaged across all repositories and weighted uniformly. ¶: all results are averaged across all repositories and weighted by the number of code lines. $: the results on the left side are averaged across all repositories and weighted uniformly, showing the overall scores. The results on the right side in the parenthesis are averaged across all valid repositories and weighted uniformly, where models have generated executable testing code. ♡: the model has generated meaningless but executable testing code.
The code for the software design evaluation can be found here👩🏽⚖️.
Model | w/ Tie | w/o Tie | ||
---|---|---|---|---|
General Principles† | Faithfulness‡ | General Principles | Faithfulness | |
GPT-4-Turbo-0125 | 97.9 | 97.9 | 100.0 | 100.0 |
GPT-4-Turbo-1106 | 91.7 | 85.4 | 100.0 | 100.0 |
CodeLlama-7B-Instruct | 4.2 | 8.3 | 4.2 | 4.5 |
CodeLlama-13B-Instruct | 18.8 | 14.6 | 10.5 | 5.3 |
CodeLlama-34B-Instruct | 39.6 | 33.3 | 33.3 | 21.4 |
DeepSeek-Coder-1.3B-Instruct | 16.7 | 16.7 | 5.5 | 5.6 |
DeepSeek-Coder-6.7B-Instruct | 35.4 | 35.4 | 31.6 | 29.4 |
DeepSeek-Coder-33B-Instruct | 52.1 | 50.0 | 53.8 | 50.0 |
Agree w/ Human Majority | 60.4 | 51.6 | 79.2 | 83.2 |
Win rate of pairwise comparison against GPT-3.5-Turbo on Software Desgin on a subset of DevBench where results are averaged across different repositories and sub-tasks uniformly.†: the general principles metric. ‡: the faithfulness metric. w/ Tie: inconsistent results are considered as a tie. We also report agreement with Human Majority.
For a secure and isolated environment, we offer Docker support for DevBench. Please refer to our detailed Installation Guide.
Add your DevBench directory to your PYTHONPATH variable.
export PYTHONPATH="${PYTHONPATH}:${path_to_devbench}"
For running the benchmark_data/java/Actor_relationship_game
repo, configure your TMDB key.
export TMDB_API_KEY=${your_TMDB_key}
Set your OpenAI API key as an environment variable.
export OPENAI_API_KEY="your_OpenAI_API_key"
For deploying open source models, please refer to lmdeploy or vllm.
After the deployment, please configure the IP address in open_source_model.json
.
For codellama and deepseek-coder models, which are integrated into our experiments, simply fill in the IP address in {"model_name": $model_ip_address}
.
For example:
{
"codellama-7b-instruct": "",
"codellama-13b-instruct": "",
"codellama-34b-instruct": "",
"deepseek-coder-1.3b-instruct": "",
"deepseek-coder-6.7b-instruct": "",
"deepseek-coder-33b-instruct": "$model_ip_address"
}
For additional models, add a new field as shown below.
{
"customized-model": {"$model_name": "$model_ip_address"}
}
cd agent_sysyem/baseline
python run.py --config Implementation --input_path ../../benchmark_data/python/TextCNN/ --model gpt-4-turbo-new --model_source openai --review execution --evaluate
SoftwareDesign
| EnvironmentSetup
| Implementation
| AcceptanceTesting
| UnitTesting
. input_path
(i.e., input_path.split('/')[-1]
)gpt-3.5-turbo
| gpt-4
| gpt-4-32k
| gpt-4-turbo
| claude-2
| claude-2.1
| codellama-7b
| codellama-13b
| codellama-34b
| deepseek-coder-1.3b
| deepseek-coder-6.7b
| deepseek-coder-33b
| customized-model
.model
parameter is customized-model
.open_source
| openai
none
| normal
| execution
.
none
: a single forward pass of Coding. normal
: Coding and CodeReview in alternation, with CodeReview lacking program execution feedback.execution
: Coding and CodeReview in alternation, with CodeReview including program execution feedback.When you use normal review and execution review, the cyclenum
parameter of CompanyConfig/{task_name}/ChatChainConfig.json
can be specified as the number of rounds of review. The default is 2.
@article{li2024devbench,
title={DevBench: A Comprehensive Benchmark for Software Development},
author={Li, Bowen and Wu, Wenhan and Tang, Ziwei and Shi, Lin and Yang, John and Li, Jinyang and Yao, Shunyu and Qian, Chen and Hui, Binyuan and Zhang, Qicheng and others},
journal={arXiv preprint arXiv:2403.08604},
year={2024}
}