haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
19 stars 11 forks source link

Evaluating LLMs capabilities relative to task-complexity #77

Open ian-coccimiglio opened 3 months ago

ian-coccimiglio commented 3 months ago

I've been working on evaluating how well LLMs can handle bioimaging tasks relative to the complexity of the task.

First, we can see that different tasks have different probabilities of being easily solved by any given LLM. At the left, we see that certain tasks (like file opening, t-tests, and making correlation matrices) are solved accurately, whereas more complex tasks are more likely to fail (such as deconvolution), in general. This is interesting, because it gives an indication of the probability of an accurate solve regardless of LLM choice. Task_Score_by_LLMs

Diving in deeper, we might expect that the more libraries that required, and the amount of code that is required, will both loosely correspond to how 'challenging' any given task is. As such, I have the following two metrics for each task. 1) The number of library imports 2) The number of lines of code in the solution (excluding import lines)

We can see that there is a negative relation of task-score to both of these quantities. Average_Task_Score_by_Complexity

Finally, the most interesting finding is that more modern/larger LLMs are generally able to solve more complex problems. This is indicated by the slopes of the regressions as neutral/positive in the case of newer or more modern LLMs.

Each model when compared to the task-complexity as measured by number of imports. LLMs_Complex_Problems_Imports

Each model when compared to the task-complexity as measured by number of lines of code in the solution. LLMs_Complex_Problems_Lines

ian-coccimiglio commented 3 months ago

Something I'd like to do with this is find a way to aggregate tasks by 'type', to provide a guideline for readers to determine which kinds of tasks LLMs are suitable for.

Also note this image.sc topic for other updates: https://forum.image.sc/t/preprint-alert-and-call-for-contributions-llms-for-bio-image-analysis/98719/21

haesleinhuepf commented 3 months ago

Hi @ian-coccimiglio ,

these are great suggestions! One thing we could immediatly take over from you work: The list of tasks in this Notebook / Figure could be sorted by average pass rate (as you show in your first figure).

I also like the idea of grouping the tasks. In some not-published work I asked an LLM to group the tasks in categories (segmentation, denoising, tabular data wrangling, complex workflows, etc.). From this experiment I learned two things: 1) individual test-cases may appear in multiple categories. 2) As our number of test-cases is quite small it is hard to conclude something from that. I mean: to conclude something that is helpful for aN LLM user...

Do you have any ideas to rephrase your analysis in some way like: "We observe higher pass-rate in case of ???. In the opposite case of ???, we recommend to review generated code more carefully."

Also I find myself often explaining to LLM users: "Ask with a simple question. Do not go for complex code at the very beginning." Is this some advice our data could supports?

Thanks again for working on this! It's amazing to see how others dive into this data :-)

ian-coccimiglio commented 3 months ago

these are great suggestions! One thing we could immediatly take over from you work: The list of tasks in this Notebook / Figure could be sorted by average pass rate (as you show in your first figure).

Yup, I think this ended up looking great, and is pretty informative too. I'll send a PR for this part

performance_per_task

ian-coccimiglio commented 3 months ago

Do you have any ideas to rephrase your analysis in some way like: "We observe higher pass-rate in case of ???. In the opposite case of ???, we recommend to review generated code more carefully."

Also I find myself often explaining to LLM users: "Ask with a simple question. Do not go for complex code at the very beginning." Is this some advice our data could supports?

Yeah, that's where I'd like this analysis to go. I think this analysis does support that advice, but I think specifically it means that LLMs have a higher success rate at solving problems with a limited scope. We can somewhat see from the above analyses that the problems that LLMs struggle the most with are larger problems which require lots of very specific and correct code and interacting libraries. But each individual piece of that problem may be independently solvable (but this may require some more analysis).

jkh1 commented 3 months ago

Related to the complexity of the task, I wonder if the performance could also be linked to whether there exists a single dedicated function to produce the expected result (possibly including code from training data like could be obtained from SO). This could give a seemingly complex task a higher success rate if all that's needed is to identify the correct function call as opposed to combining multiple operations or reasoning to find an applicable function. For example, t-test or umap most likely have dedicated functions whose name matches the query whereas the task of summing images doesn't have one but can be achieved through e.g. numpy.sum() which doesn't mention images in its description.

ian-coccimiglio commented 2 months ago

@jkh1 I agree, that is somewhat where I was going with "lines of code". In fact, originally I was going to quantify the number of functions used, but then I realized that the majority of common tasks bundle together functionality, and an LLM may just have to determine the right one.

jkh1 commented 2 months ago

Also relevant could be the way the task is formulated as pointed out by Tischi in #79