haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
13 stars 4 forks source link

sampled+evaluated gpt4o, reran plotting notebooks #66

Closed haesleinhuepf closed 1 day ago

haesleinhuepf commented 1 month ago

This PR contains:

Related github issue (if relevant): closes #0

Short description:

How do you think will this influence the benchmark results?

image

Why do you think it makes sense to merge this PR?

This should not be merged yet, as the paper text wasn't adapted yet.

haesleinhuepf commented 1 month ago

Adding a note on costs: benchmarking gpt-4o costed $2.73

image