[ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new dependencies in requirements.txt
[ ] The environment.yml file was updated using the command conda env export > environment.yml
[x] new generator-functions allowing to sample from other LLMs
[x] new samples (sample_....jsonl files)
[x] new benchmarking results (..._results.jsonl files)
[ ] documentation update
[ ] bug fixes
Related github issue (if relevant): closes #0
Short description:
I sampled+evaluated gpt-4o, which was released this week.
I also tried sampling gemini 1.5, but failed due to rate limits.
How do you think will this influence the benchmark results?
gpt-4o is the new leader in our benchmark (pass@1=0.51+-0.41), it is a little bit better than gpt4 turbo (pass@1=0.47+-0.38), the former leader.
Why do you think it makes sense to merge this PR?
We should include new models when they come up and there is a substantial interest by the community to benchmark them. As gpt-4o is hyped and promised to be better than gpt-4-turbo, it makes sense to include it in our benchmark.
This should not be merged yet, as the paper text wasn't adapted yet.
This PR contains:
conda env export > environment.yml
sample_....jsonl
files)..._results.jsonl
files)Related github issue (if relevant): closes #0
Short description:
How do you think will this influence the benchmark results?
Why do you think it makes sense to merge this PR?
This should not be merged yet, as the paper text wasn't adapted yet.