haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
20 stars 14 forks source link
benchmarking bioimage-analysis llm python

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

DOI

This is a fork of the HumanEval repository where modifications were made to adapt the evaluation for Benchmarking LLMs in the Bio-image Analysis domain. The original HumanEval repository is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".

Insights

Using the benchmark in this repository, we compared 15 LLMs in their capabilities to generate Bio-image Analysis Python code. Therefore, we defined test-cases listed here which can be used to evaluate functional correctness of bio-image analysis code. The pass-rate shown in this plot expresses the probability that generated code passed given unit-tests:

Furthermore, we visualize the observed pass-rate per task:

To find out more, please read our preprint

Feedback is welcome, e.g. as Github issue.

Installation

Make sure to use python 3.10 or later:

$ mamba create --name heb python=3.10
$ conda activate heb

Check out and install this repository:

$ git clone https://github.com/haesleinhuepf/human-eval-bia.git
$ cd human-eval-bia
$ pip install -e .
$ pip install -r requirements.txt

To run the benchmark for OpenAI-based models, please create an OpenAI API Key as explained in this blog post.

To run the benchmark for Google gemini models, you need to create a Google Cloud account here and a project within the Google cloud (for billing) here. You need to store authentication details locally as explained here. This requires installing Google Cloud CLI. In very short: run the installer and when asked, activate the "Run gcloud init' checkbox. Or run 'gcloud init' from the terminal yourself. Restart the terminal window. After installing Google Cloud CLI, start a terminal and authenticate using:

gcloud auth application-default login

Follow the instructions in the browser. Enter your Project ID (not the name).

To run the benchmark for the models accessible via Helmholtz' blablador service, which is free for German academics, just get an API key as explained on this page and store it int your environment as BLABLADOR_API_KEY variable.

To run the benchmark for locally running models, install Ollama. We used ollama 0.1.29 for Windows (preview).

Usage

This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.

![CAUTION]

Also note when executing the benchmark using commercial models such as chatGPT or gemini, substantial costs can be caused.

To reproduce our benchmarks, you can go through the notebooks provided in the /notebooks directory:

Extending the benchmark

You can add new test cases by adding new notebooks to the /notebooks/human-eval-bia directory. Check out the examples there and make sure to stick to the following rules.

![CAUTION]

Most importantly: When writing new test case notebooks, do not use language models for code generation. You would otherwise bias the benchmark towards this model. Use human-writen code only and/or examples from the documentation of specific librarires.

The notebooks have to have the following format:

Adding dependencies

We aim at collecting all Python libraries that LLMs are capable of using in the bio-image analysis context in the requirements.txt file. Additionally, for documentation purposes, we want to document in which environment the last evaluation was executed. Thus, also an environment.yml file needs to be updated, in particular when requirements change. If the new test-case requires specific Python libraries to be installed, please add them to the requirements.txt. Also update the environment.yml file using this command:

conda env export > environment.yml 

Submit both files together with your pull-request. That way we can see how the environment changes when merging a pull-request.

How it works

This is how it works under the hood:

Our modifications compared to HumanEval

You can compare the original HumanEval code with ours to see modifications here. The modifications include adding our test cases and jsonl files. Furthermore, on techincal level, we did these modifications to the HumanEval evaluation framework:

Citation

To cite our work, e.g. if you are using the Bio-image Analysis test-case set, please cite the following:

@article {benchmark_llm_bia,
    author = {Robert Haase and Christian Tischer and Jean-Karim H{\'e}rich{\'e} and Nico Scherf},
    title = {Benchmarking Large Language Models for Bio-Image Analysis Code Generation},
    elocation-id = {2024.04.19.590278},
    year = {2024},
    doi = {10.1101/2024.04.19.590278},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/04/25/2024.04.19.590278},
    eprint = {https://www.biorxiv.org/content/early/2024/04/25/2024.04.19.590278.full.pdf},
    journal = {bioRxiv}
}

In case you are only using the evaluation code in this repository, consider using and citing HumanEval instead.