sampled+evaluated gpt4o, reran plotting notebooks

This PR contains:

[ ] a new test-case for the benchmark
- [ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new dependencies in requirements.txt
- [ ] The environment.yml file was updated using the command conda env export > environment.yml
[x] new generator-functions allowing to sample from other LLMs
[x] new samples (sample_....jsonl files)
[x] new benchmarking results (..._results.jsonl files)
[ ] documentation update
[ ] bug fixes

Related github issue (if relevant): closes #0

Short description:

How do you think will this influence the benchmark results?

gpt-4o is the new leader in our benchmark (pass@1=0.51+-0.41), it is a little bit better than gpt4 turbo (pass@1=0.47+-0.38), the former leader.

Why do you think it makes sense to merge this PR?

We should include new models when they come up and there is a substantial interest by the community to benchmark them. As gpt-4o is hyped and promised to be better than gpt-4-turbo, it makes sense to include it in our benchmark.

This should not be merged yet, as the paper text wasn't adapted yet.

haesleinhuepf / human-eval-bia