[ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new generator-functions allowing to sample from other LLMs
[] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[x] documentation update
[ ] bug fixes
Related github issue (if relevant): closes #0
Short description:
I changed the name of the "canonical" solution to "reference"
I modified the data visualizatio notebook so that the Table 1 in the paper is exported as PNG, with a colorbar. Also this table is now sorted: The best model on the left, the worst on the right.
Sample and result jsonl files were just renamed. There are no new samples or results.
How do you think will this influence the benchmark results?
Not. This just improves visualization of results.
Why do you think it makes sense to merge this PR?
This improves readability. Also people won't ask "what LLM is canonical?"
This PR contains:
sample_....jsonl
files)..._results.jsonl
files)Related github issue (if relevant): closes #0
Short description:
How do you think will this influence the benchmark results?
Why do you think it makes sense to merge this PR?