Better data visualization - Githubissues

haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation

MIT License

19 stars 11 forks source link

Better data visualization #25

Closed haesleinhuepf closed 5 months ago

haesleinhuepf commented 5 months ago

This PR contains:

[ ] a new test-case for the benchmark
- [ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new generator-functions allowing to sample from other LLMs
[] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[x] documentation update
[ ] bug fixes

Related github issue (if relevant): closes #0

Short description:

I changed the name of the "canonical" solution to "reference"
I modified the data visualizatio notebook so that the Table 1 in the paper is exported as PNG, with a colorbar. Also this table is now sorted: The best model on the left, the worst on the right.
Sample and result jsonl files were just renamed. There are no new samples or results.

How do you think will this influence the benchmark results?

Not. This just improves visualization of results.

Why do you think it makes sense to merge this PR?

This improves readability. Also people won't ask "what LLM is canonical?"

performance_per_task