[x] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new generator-functions allowing to sample from other LLMs
[ ] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[ ] documentation update
[ ] bug fixes
Related github issue (if relevant): closes #0
Short description:
This adds a test-case for applying Bland-Altman analysis
As we cannot test if a plot is drawn correctly, we cannot test the Bland-Altman plot.
How do you think will this influence the benchmark results?
This is a relatively easy test-case. I presume that most LLMs should be able to solve it.
Why do you think it makes sense to merge this PR?
Bland-Altman-plots are a useful tool for visualizing differences between methods. As we cannot test plotting tools, this is the minimum we can do in this context.
This PR contains:
sample_....jsonl
files)..._results.jsonl
files)Related github issue (if relevant): closes #0
Short description:
How do you think will this influence the benchmark results?
Why do you think it makes sense to merge this PR?