Samples from recent open source models.

This PR contains:

[ ] a new test-case for the benchmark
- [ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new dependencies in requirements.txt
- [ ] The environment.yml file was updated using the command conda env export > environment.yml
[x] new generator-functions allowing to sample from other LLMs
[x] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[ ] documentation update
[ ] bug fixes

Related github issue (if relevant): closes #0

Short description:

How do you think will this influence the benchmark results?

Why do you think it makes sense to merge this PR?

haesleinhuepf / human-eval-bia