haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
20 stars 13 forks source link

New test cases suggestions #141

Open rmassei opened 1 week ago

rmassei commented 1 week ago

Hi! I was wondering if the following test cases might be of interest:

tischi commented 1 week ago

+1 for bioio

rmassei commented 1 week ago

ome-types is another options, since I found myself to often use it to compile ome.xml with standardize metadata

rmassei commented 1 week ago

PR in https://github.com/haesleinhuepf/human-eval-bia/pull/142

haesleinhuepf commented 1 week ago

Love it, thanks for the proposals @rmassei ! Regarding mahotas and omero: I'd say these test-cases would be interesting, if you formulare the prompts in a way, that mahotas / omero are not mentioned as library. So far, our test-cases typically formulate a problem, and do not ask for specific libraries (with the exception of numpy). In my opinion (others may have other opinions), we should not have test-cases like "Register the images using ITK", but better "register the images" and with this, give the LLM the freedom to also use libraries, we did not think of. You see in this figure of the current paper, that the LLMs chose to use openCV (cv2) in many cases, and we did not think of this.

Hence, do you know any use-cases that can be solved with mahotas, which were not on our list yet?

rmassei commented 1 week ago

Thanks for the explanation @haesleinhuepf! I did not add mahotas cases in the PR since, after looking at the existing cases, inspiration did not knock at my door and I could not find anything new to add, at least as a snippet. I will check the documentation and see if there are some tasks which can be interesting to solve with mahotas, but I will then probably open another PR.

haesleinhuepf commented 1 week ago

I dived into mahotas earlier, and found the seeded watershed implementation nice; much easier to use than the one from scikit-image. Maybe that's an inspiration. And if not that's ok too. With the use-cases you sent, you will already be the 2nd or 3rd most-contributing contributor ;-)

tischi commented 1 week ago

I generally agree on not asking about solutions implemented in specific packages.

However, don't we implicitly only allow for specific packages because our testing environment does not have everything? I forgot: Are the available libraries something that we communicate to the LLM for writing the code?

haesleinhuepf commented 1 week ago

The readme explains how to deal with missing dependencies (we add them to the requirements.txt) and has a link to a notebook for detecting missing stuff: https://github.com/haesleinhuepf/human-eval-bia/blob/main/demo/detect_missing_requirements.ipynb