Add claude 3.5 sonnet - Githubissues

This PR contains:

[ ] a new test-case for the benchmark
- [ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new dependencies in requirements.txt
- [ ] The environment.yml file was updated using the command conda env export > environment.yml
[x] new generator-functions allowing to sample from other LLMs
[x] new samples (sample_....jsonl files)
[x] new benchmarking results (..._results.jsonl files)
[x] documentation update
[ ] bug fixes

Short description:

How do you think will this influence the benchmark results?

Why do you think it makes sense to merge this PR?

There was a hype around this new model in social media, with claims it outperforms gpt-4o in many benchmarks. We can confirm this claim. Hence, we should include it in our list.

Before merging this, we need to update thet paper text though,

haesleinhuepf / human-eval-bia