This is a fork of the HumanEval repository where modifications were made to adapt the evaluation for Benchmarking LLMs in the Bio-image Analysis domain. The original HumanEval repository is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".
Using the benchmark in this repository, we compared 15 LLMs in their capabilities to generate Bio-image Analysis Python code. Therefore, we defined test-cases listed here which can be used to evaluate functional correctness of bio-image analysis code. The pass-rate shown in this plot expresses the probability that generated code passed given unit-tests:
Furthermore, we visualize the observed pass-rate per task:
To find out more, please read our preprint
Feedback is welcome, e.g. as Github issue.
Make sure to use python 3.10 or later:
$ mamba create --name heb python=3.10
$ conda activate heb
Check out and install this repository:
$ git clone https://github.com/haesleinhuepf/human-eval-bia.git
$ cd human-eval-bia
$ pip install -e .
$ pip install -r requirements.txt
To run the benchmark for OpenAI-based models, please create an OpenAI API Key as explained in this blog post.
To run the benchmark for Google gemini models, you need to create a Google Cloud account here and a project within the Google cloud (for billing) here. You need to store authentication details locally as explained here. This requires installing Google Cloud CLI. In very short: run the installer and when asked, activate the "Run gcloud init' checkbox. Or run 'gcloud init' from the terminal yourself. Restart the terminal window. After installing Google Cloud CLI, start a terminal and authenticate using:
gcloud auth application-default login
Follow the instructions in the browser. Enter your Project ID (not the name).
To run the benchmark for the models accessible via Helmholtz' blablador service, which is free for German academics, just get an API key as explained on
this page and store it int your environment as BLABLADOR_API_KEY
variable.
To run the benchmark for locally running models, install Ollama. We used ollama 0.1.29 for Windows (preview).
This program exists to run untrusted model-generated code. Users are strongly
encouraged not to do so outside of a robust security sandbox. The execution
call
in execution.py
is deliberately commented out to ensure users read this
disclaimer before running code in a potentially unsafe manner. See the comment in
execution.py
for more information and instructions.
![CAUTION]
Also note when executing the benchmark using commercial models such as chatGPT or gemini, substantial costs can be caused.
To reproduce our benchmarks, you can go through the notebooks provided in the /notebooks
directory:
You can add new test cases by adding new notebooks to the /notebooks/human-eval-bia
directory.
Check out the examples there and make sure to stick to the following rules.
![CAUTION]
Most importantly: When writing new test case notebooks, do not use language models for code generation. You would otherwise bias the benchmark towards this model. Use human-writen code only and/or examples from the documentation of specific librarires.
The notebooks have to have the following format:
def sum(a, b):
"""
This function computes the sum of two numbers.
"""
return a + b
def check(candiate):
and contains test code to test the generated code.assert
statements and call the candidate
function. E.g. if a given function to test is sum
, then a valid test for sum
would be:
def check(candidate):
assert candidate(3, 4) == 7
check
function with your custom function, e.g. like this, to prove that the code you provided works with the tests you wrote:
check(sum)
sum.ipynb
.We aim at collecting all Python libraries that LLMs are capable of using in the bio-image analysis context in the requirements.txt file. Additionally, for documentation purposes, we want to document in which environment the last evaluation was executed. Thus, also an environment.yml file needs to be updated, in particular when requirements change. If the new test-case requires specific Python libraries to be installed, please add them to the requirements.txt. Also update the environment.yml file using this command:
conda env export > environment.yml
Submit both files together with your pull-request. That way we can see how the environment changes when merging a pull-request.
This is how it works under the hood:
jsonl
file.You can compare the original HumanEval code with ours to see modifications here. The modifications include adding our test cases and jsonl files. Furthermore, on techincal level, we did these modifications to the HumanEval evaluation framework:
Fix can't pickle bug . Here we took code provided as pull-request to the original HumanEval repository, which was not merged by the maintaines but seemed reasonable.
Fix windows-related signal issue. This modification was necessary to make the evaluation run on Windows. See also the discussion in this github issue.
We disabled reliability_guard because it broke all tests. Different compared to HumanEval, our test-cases involve complex python libraries which do system calls in order to process data. Disabling these calls made our tests fail.
We added some code to copy example data to the temporary folder. This enables us to run tests where the file system is used, e.g. to solve tasks such as "list all image files in a folder". Original HumanEval was not capable of evaluating such questions.
To cite our work, e.g. if you are using the Bio-image Analysis test-case set, please cite the following:
@article {benchmark_llm_bia,
author = {Robert Haase and Christian Tischer and Jean-Karim H{\'e}rich{\'e} and Nico Scherf},
title = {Benchmarking Large Language Models for Bio-Image Analysis Code Generation},
elocation-id = {2024.04.19.590278},
year = {2024},
doi = {10.1101/2024.04.19.590278},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/04/25/2024.04.19.590278},
eprint = {https://www.biorxiv.org/content/early/2024/04/25/2024.04.19.590278.full.pdf},
journal = {bioRxiv}
}
In case you are only using the evaluation code in this repository, consider using and citing HumanEval instead.