open-compass / DevBench

A Comprehensive Benchmark for Software Development.
Apache License 2.0
84 stars 5 forks source link

DevBench: Towards LLMs based Automated Software Development

👋 Overview | 📖 Benchmarking | ⚙️ Setup | 🚀 Usage | 🔎 Citation | 📄 License

📬 Contact: libowen.ne@gmail.com, chao.peng@acm.org

📝 Check out our paper HERE !

👋 Overview

📖 Benchmarking Code LLMs

Evaluation results of the coding tasks on DevBench.

<table border-collapse:collapse;border:none;>

Model Environment Setup Implementation Acceptance Testing Unit Testing
 <tr>
    <th style="text-align:center">Pass@ Example Usage§</th>
    <th style="text-align:center">Pass@ Accept. Test¶ </th>
    <th style="text-align:center">Pass@ Unit Test¶</th>
    <th style="text-align:center">Oracle Test§</th>
    <th style="text-align:center">Oracle Test§</th>
    <th style="text-align:center">Coverage$</th>
</tr>
<tr>
    <td style="text-align:center">GPT-3.5-Turbo</td>
    <td style="text-align:center"><em>33.3</em></td>
    <td style="text-align:center">4.2</td>
    <td style="text-align:center">4.3</td>
    <td style="text-align:center">11.7</td>
    <td style="text-align:center">28.7</td>
    <td style="text-align:center">24.6(61.4)</td>
</tr>
<tr>
    <td style="text-align:center">GPT-4-Turbo-1106</td>
    <td style="text-align:center"><em>41.7</em></td>
    <td style="text-align:center">6.9</td>
    <td style="text-align:center">6.8</td>
    <td style="text-align:center">25.9</td>
    <td style="text-align:center">33.6</td>
    <td style="text-align:center">36.7(66.7)</td>
</tr>
<tr>
    <td style="text-align:center">GPT-4-Turbo-0125</td>
    <td style="text-align:center"><em>41.7</em></td>
    <td style="text-align:center">7.1</td>
    <td style="text-align:center">8.0</td>
    <td style="text-align:center">29.2</td>
    <td style="text-align:center">36.5</td>
    <td style="text-align:center">33.2(66.3)</td>
</tr>
<tr>
    <td style="text-align:center">CodeLlama-7B-Instruct</td>
    <td style="text-align:center"><em>8.3</em></td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">3.0</td>
    <td style="text-align:center">3.6(71.0)</td>
</tr>
<tr>
    <td style="text-align:center">CodeLlama-13B-Instruct</td>
    <td style="text-align:center"><em>25.0</em></td>
    <td style="text-align:center">0.6</td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">5.1</td>
    <td style="text-align:center">8.6(57.6)</td>
</tr>
<tr>
    <td style="text-align:center">CodeLlama-34B-Instruct</td>
    <td style="text-align:center"><em>16.7</em></td>
    <td style="text-align:center">0.6</td>
    <td style="text-align:center">0.5</td>
    <td style="text-align:center">4.5</td>
    <td style="text-align:center">21.1</td>
    <td style="text-align:center">25.4(72.6)</td>
</tr>
<tr>
    <td style="text-align:center">DeepSeek-Coder-1.3B-Instruct</td>
    <td style="text-align:center"><em>8.3</em></td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">0.1</td>
    <td style="text-align:center">0.0</td>
    <td style="text-align:center">5.6</td>
    <td style="text-align:center">2.7(27.0)</td>
<tr>
    <td style="text-align:center">DeepSeek-Coder-6.7B-Instruct</td>
    <td style="text-align:center"><em>25.0</em></td>
    <td style="text-align:center">2.9</td>
    <td style="text-align:center">3.9</td>
    <td style="text-align:center">20.5♡</td>
    <td style="text-align:center">23.5</td>
    <td style="text-align:center">28.2(70.6)</td>
</tr>
<tr>
    <td style="text-align:center">DeepSeek-Coder-33B-Instruct</td>
    <td style="text-align:center"><em>16.7</em></td>
    <td style="text-align:center">4.4</td>
    <td style="text-align:center">5.5</td>
    <td style="text-align:center">13.6</td>
    <td style="text-align:center">32.8</td>
    <td style="text-align:center">35.7(79.4)</td>
  </tr>
</tr>

Italic figures: test cases for the Environment Setup task are scarce compared to other tasks, therefore the results are more influenced by the randomness. §: all results are averaged across all repositories and weighted uniformly. ¶: all results are averaged across all repositories and weighted by the number of code lines. $: the results on the left side are averaged across all repositories and weighted uniformly, showing the overall scores. The results on the right side in the parenthesis are averaged across all valid repositories and weighted uniformly, where models have generated executable testing code. ♡: the model has generated meaningless but executable testing code.

Evaluation results of the software design on DevBench.

The code for the software design evaluation can be found here👩🏽‍⚖️.

Model w/ Tie w/o Tie
General Principles† Faithfulness‡ General Principles Faithfulness
GPT-4-Turbo-0125 97.9 97.9 100.0 100.0
GPT-4-Turbo-1106 91.7 85.4 100.0 100.0
CodeLlama-7B-Instruct 4.2 8.3 4.2 4.5
CodeLlama-13B-Instruct 18.8 14.6 10.5 5.3
CodeLlama-34B-Instruct 39.6 33.3 33.3 21.4
DeepSeek-Coder-1.3B-Instruct 16.7 16.7 5.5 5.6
DeepSeek-Coder-6.7B-Instruct 35.4 35.4 31.6 29.4
DeepSeek-Coder-33B-Instruct 52.1 50.0 53.8 50.0
Agree w/ Human Majority 60.4 51.6 79.2 83.2

Win rate of pairwise comparison against GPT-3.5-Turbo on Software Desgin on a subset of DevBench where results are averaged across different repositories and sub-tasks uniformly.†: the general principles metric. ‡: the faithfulness metric. w/ Tie: inconsistent results are considered as a tie. We also report agreement with Human Majority.

🐳 Set Up with Docker

For a secure and isolated environment, we offer Docker support for DevBench. Please refer to our detailed Installation Guide.

🚀 Usage

1. Prepare the environment variables

Add your DevBench directory to your PYTHONPATH variable.

export PYTHONPATH="${PYTHONPATH}:${path_to_devbench}"

For running the benchmark_data/java/Actor_relationship_game repo, configure your TMDB key.

export TMDB_API_KEY=${your_TMDB_key}

2. Prepare the chat models

OpenAI GPT models

Set your OpenAI API key as an environment variable.

export OPENAI_API_KEY="your_OpenAI_API_key"

Open source models

For deploying open source models, please refer to lmdeploy or vllm.

After the deployment, please configure the IP address in open_source_model.json.

For codellama and deepseek-coder models, which are integrated into our experiments, simply fill in the IP address in {"model_name": $model_ip_address}.

For example:

{
  "codellama-7b-instruct": "",
  "codellama-13b-instruct": "",
  "codellama-34b-instruct": "",
  "deepseek-coder-1.3b-instruct": "",
  "deepseek-coder-6.7b-instruct": "",
  "deepseek-coder-33b-instruct": "$model_ip_address"
}

For additional models, add a new field as shown below.

{
  "customized-model": {"$model_name": "$model_ip_address"}
}

3. Run the agent system

Run script

cd agent_sysyem/baseline
python run.py --config Implementation --input_path ../../benchmark_data/python/TextCNN/ --model gpt-4-turbo-new --model_source openai  --review execution --evaluate

Parameters

When you use normal review and execution review, the cyclenum parameter of CompanyConfig/{task_name}/ChatChainConfig.json can be specified as the number of rounds of review. The default is 2.

🔎 Citation

@article{li2024devbench,
  title={DevBench: A Comprehensive Benchmark for Software Development},
  author={Li, Bowen and Wu, Wenhan and Tang, Ziwei and Shi, Lin and Yang, John and Li, Jinyang and Yao, Shunyu and Qian, Chen and Hui, Binyuan and Zhang, Qicheng and others},
  journal={arXiv preprint arXiv:2403.08604},
  year={2024}
}

📄 License