terryyz / ice-score

[EACL 2024] ICE-Score: Instructing Large Language Models to Evaluate Code
https://arxiv.org/abs/2304.14317
MIT License
69 stars 8 forks source link
automatic-evaluation code-generation code-quality evaluation gpt-4 large-language-models llm

ICE-Score: Instructing Large Language Models to Evaluate Code

January 2024 - ICE-Score has been accepted to EACL 2024 πŸŽ‰πŸŽ‰πŸŽ‰


Example

Environment Setup

Our experiment is mainly built on the codegen-metrics and code-bert-score repositories. To replicate all experiments, please follow their instructions to set up the environment.

To run compute_results.ipynb and modules in llm-code-eval folder, use the following command to install all dependencies:

pip install -r requirements.txt

Folder Description

Usage

We implement a minimum viable product (MVP) for this project. To install the project, please use the following command:

pip install -e .

You can use it to evaluate any generated code snippet, with the inputs of problem, output, task, aspect and model, like the following example:

from llm_code_eval import evaluate

score = evaluate(problem="Given a list of integers, return the sum of all the integers.", 
                    output="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 
                    task="code-gen", aspect="usefulness", model="gpt-3.5-turbo")

print(score)

If you want to evaluate with reference code, you can use the option of reference in the following example:

from llm_code_eval import evaluate

score = evaluate(problem="Given a list of integers, return the sum of all the integers.", 
                output="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 
                reference="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 
                task="code-gen", aspect="usefulness", model="gpt-3.5-turbo")

print(score)

You can also use the option of cot=True to enable the zero-shot chain-of-thought evaluation in the following example:

from llm_code_eval import evaluate

score, eval_step = evaluate(problem="Given a list of integers, return the sum of all the integers.", 
                            output="sum = 0\nfor i in range(len(list)):\n\tsum += list[i]\nreturn sum", 
                            task="code-gen", aspect="usefulness", model="gpt-3.5-turbo", cot=True)

print(score)
print(eval_step)

Citation

@inproceedings{zhuo2024ice,
  title={ICE-Score: Instructing Large Language Models to Evaluate Code},
  author={Zhuo, Terry Yue},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
  pages={2232--2242},
  year={2024}
}

Acknowledgement

We thank JetBrains Research and NeuLab for their open-source code and data.