haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
19 stars 11 forks source link

How to deal with tests that fail due to missing dependencies #39

Open tischi opened 5 months ago

tischi commented 5 months ago

How do we deal with tests that fail due to missing dependencies?

Example: the LLM decides to use the circle-fit package to fit a circle, but we don't have that in our environment?

I think it would be unfair to count this as a fail. For instance, if there is one LLM that tends to use libraries that we do not provide we would think it is not very good even though it may be amazing.

I think a solution could be to let the LLM know upon creation of the solution, which libraries are available, and instruct it to only use those. Maybe this is anyway already the case?

haesleinhuepf commented 5 months ago

How about searching the _result.json files for "ImportError" to get an idea, which packages are missing, and potentially add them to requirements.txt, before running the evaluation a second time?

tischi commented 5 months ago

Yes, maybe, but I don't know if that would be a long term efficient solution, because every new network and every new test might result in other dependencies?

haesleinhuepf commented 5 months ago

every new test might result in other dependencies?

I think in practice it's not so much to do. I wrote this as a notebook in #42 . Currently, I see those dependencies could be added:

And everyone who wants to add tests for specific libraries has to add them to the requirements as outlined in #41

What do you think?

tischi commented 5 months ago

to the requirements as outlined in ...

Could you explain me the relation of the requirements.txt and the environment.yml ? I so far only edited the requirements.txt to add new dependencies...


In general, great that you added this notebook to detect potentially failing dependencies. As mentioned in your PR, it seems a bit tricky to automatically detect whether (a) we don't provide the environment, or whether (b) the LLM made a mistake (e.g. skimage.label vs. skimage.measure.label).

Maybe we could ask the LLM not only to implement the function but also provide the corresponding conda or pip installlaton instructions. But then executing these installations every time is probably too expensive in terms of time, energy (and CO2) consumption.

Maybe the following could be an idea: we do ask the LLM for the installation instructions and then we could automatically check whether all the dependencies that are suggested by the model are available in our environment, and if so, we just run the test without installing anything. If however the dependency is missing, we could flag this somewhere and have a special way to count such a test, maybe as NA.

Maybe we should write something about this issue in the discussion section of the paper? Then maybe someone (e.g. a reviewer) may suggest something?

haesleinhuepf commented 5 months ago

Maybe we should write something about this issue in the discussion section of the paper?

Definitely, yes.

And as shown in #42 this hardly is an issue. We're talking about 3 libraries which were missing in 2800 generated codes.

Could you explain me the relation of the requirements.txt and the environment.yml ?

I tried that in #41 . The goal is to have a list of requirements that should be installed (requirements.txt) and a complete list of all dependencies that are actually installed in the environment (.yml).

haesleinhuepf commented 5 months ago

I added an explanatory sentence to #42 and to the google doc