Open tischi opened 5 months ago
How about searching the _result.json files for "ImportError" to get an idea, which packages are missing, and potentially add them to requirements.txt, before running the evaluation a second time?
Yes, maybe, but I don't know if that would be a long term efficient solution, because every new network and every new test might result in other dependencies?
every new test might result in other dependencies?
I think in practice it's not so much to do. I wrote this as a notebook in #42 . Currently, I see those dependencies could be added:
And everyone who wants to add tests for specific libraries has to add them to the requirements as outlined in #41
What do you think?
to the requirements as outlined in ...
Could you explain me the relation of the requirements.txt and the environment.yml ? I so far only edited the requirements.txt to add new dependencies...
In general, great that you added this notebook to detect potentially failing dependencies. As mentioned in your PR, it seems a bit tricky to automatically detect whether (a) we don't provide the environment, or whether (b) the LLM made a mistake (e.g. skimage.label vs. skimage.measure.label).
Maybe we could ask the LLM not only to implement the function but also provide the corresponding conda or pip installlaton instructions. But then executing these installations every time is probably too expensive in terms of time, energy (and CO2) consumption.
Maybe the following could be an idea: we do ask the LLM for the installation instructions and then we could automatically check whether all the dependencies that are suggested by the model are available in our environment, and if so, we just run the test without installing anything. If however the dependency is missing, we could flag this somewhere and have a special way to count such a test, maybe as NA.
Maybe we should write something about this issue in the discussion section of the paper? Then maybe someone (e.g. a reviewer) may suggest something?
Maybe we should write something about this issue in the discussion section of the paper?
Definitely, yes.
And as shown in #42 this hardly is an issue. We're talking about 3 libraries which were missing in 2800 generated codes.
Could you explain me the relation of the requirements.txt and the environment.yml ?
I tried that in #41 . The goal is to have a list of requirements that should be installed (requirements.txt) and a complete list of all dependencies that are actually installed in the environment (.yml).
I added an explanatory sentence to #42 and to the google doc
How do we deal with tests that fail due to missing dependencies?
Example: the LLM decides to use the circle-fit package to fit a circle, but we don't have that in our environment?
I think it would be unfair to count this as a fail. For instance, if there is one LLM that tends to use libraries that we do not provide we would think it is not very good even though it may be amazing.
I think a solution could be to let the LLM know upon creation of the solution, which libraries are available, and instruct it to only use those. Maybe this is anyway already the case?