[ ] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new generator-functions allowing to sample from other LLMs
[ ] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[x] documentation update
[ ] bug fixes
Related github issue (if relevant): closes #0
Short description:
This delivers a way for estimating the number of unit-tests per test-case. We're counting the actual assert statements which test our functions.
How do you think will this influence the benchmark results?
It adds another metric that tells us something about the quantity of the tests.
Why do you think it makes sense to merge this PR?
I'm not sure if we need it. The original HumanEval paper mentions approx 7.7 unit tests per test-case. We have approx 2.5 assert statements per test-case... I'm not sure if these are comparable...
This PR contains:
sample_....jsonl
files)..._results.jsonl
files)Related github issue (if relevant): closes #0
Short description:
assert
statements which test our functions.How do you think will this influence the benchmark results?
Why do you think it makes sense to merge this PR?