[x] I hereby confirm that NO LLM-based technology (such as github copilot) was used while writing this benchmark
[ ] new generator-functions allowing to sample from other LLMs
[ ] new samples (sample_....jsonl files)
[ ] new benchmarking results (..._results.jsonl files)
[ ] documentation update
[ ] bug fixes
Related github issue (if relevant): closes #23
Short description:
This adds a new test-case for processing images in tiles. I'm using dask for this (which is a new dependency for this project), but it could certainly be done without.
How do you think will this influence the benchmark results?
I have not tested this, but I presume this test-case is a hard one. I could imagine that currently no LLM can solve it. This might decrease pass-rates for all LLMs.
Why do you think it makes sense to merge this PR?
Tiled image processing is a common task in bio-image analysis. It makes sense to include this in our benchmark.
This PR contains:
sample_....jsonl
files)..._results.jsonl
files)Related github issue (if relevant): closes #23
Short description:
How do you think will this influence the benchmark results?
Why do you think it makes sense to merge this PR?