add notebook to summarize common failure reasons

This PR contains:

[ ] a new test-case for the benchmark
- [ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new dependencies in requirements.txt
- [ ] The environment.yml file was updated using the command conda env export > environment.yml
[ ] new generator-functions allowing to sample from other LLMs
[ ] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[x] documentation update
[ ] bug fixes

Related github issue (if relevant): related to #17

Short description:

This brings a new notebook that analyzes the reasons why tests failed. It counts common errors and shows the three most common errors per model.

How do you think will this influence the benchmark results?

This adds a new quality indicator: It counts the errors that happened and how often results were wrong.
Existing benchmark results are not modified.

Why do you think it makes sense to merge this PR?

@tischi This might be interesting for you as you aimed in that direction as discussed in #17

haesleinhuepf / human-eval-bia