[ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new dependencies in requirements.txt
[ ] The environment.yml file was updated using the command conda env export > environment.yml
[x] new generator-functions allowing to sample from other LLMs
[x] new samples (sample_....jsonl files)
[x] new benchmarking results (..._results.jsonl files)
[x] documentation update
[ ] bug fixes
Short description:
This adds a new LLM to the benchmark: claude-3.5-sonnet
How do you think will this influence the benchmark results?
claude-3.5-sonnet is the new leading model in our benchmark
results from other LLMs are not changed
Why do you think it makes sense to merge this PR?
There was a hype around this new model in social media, with claims it outperforms gpt-4o in many benchmarks. We can confirm this claim. Hence, we should include it in our list.
Before merging this, we need to update thet paper text though,
This PR contains:
conda env export > environment.yml
sample_....jsonl
files)..._results.jsonl
files)Short description:
How do you think will this influence the benchmark results?
Why do you think it makes sense to merge this PR?
Before merging this, we need to update thet paper text though,