add bland-altman test case - Githubissues

haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

MIT License

13 stars 4 forks source link

add bland-altman test case #31

Closed haesleinhuepf closed 2 months ago

haesleinhuepf commented 2 months ago

This PR contains:

[x] a new test-case for the benchmark
- [x] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new generator-functions allowing to sample from other LLMs
[ ] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[ ] documentation update
[ ] bug fixes

Related github issue (if relevant): closes #0

Short description:

This adds a test-case for applying Bland-Altman analysis
As we cannot test if a plot is drawn correctly, we cannot test the Bland-Altman plot.

How do you think will this influence the benchmark results?

This is a relatively easy test-case. I presume that most LLMs should be able to solve it.

Why do you think it makes sense to merge this PR?

Bland-Altman-plots are a useful tool for visualizing differences between methods. As we cannot test plotting tools, this is the minimum we can do in this context.