haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
16 stars 4 forks source link

Test cases from the Icy / Fiji or Java ecosystems? #74

Closed tinevez closed 2 weeks ago

tinevez commented 3 weeks ago

Hello all. Great work!

All the test cases seem to be using Python. I know Java is living in its twilight for bioimage analysis, but the tools developed in this language have still a very strong impact on biologists. This might not be true forever but it is still the case today and probably for a little time.

Would it make sense to include test cases that would help people developing Java plugins for bioimage analysis? For instance testing

Even if LLMs are unable to score high in ranking for these frameworks, this would be still an interesting conclusion. Probably, we might also want to consider Julia language as a rising star.

haesleinhuepf commented 3 weeks ago

Hi @tinevez ,

yeah, that's a great point! Unfortunately, we were reusing the HumanEval framework here which is Python specific.

More general LLM coding capabilities in other programming languages have been benchmarked before though. So technically it should be possible. Just ImageJ Macro was not elaborated on yet afaik.

Even if LLMs are unable to score high in ranking for these frameworks, this would be still an interesting conclusion.

Fully agreed! Measuring is knowing. Also one more interesting aspect: The Python ecosystem is characterised by excellent online documentation. If the Icy/Fiji/Java ecosystem was characterised by even better online documentation, I could imagine that LLMs perform better in this ecosystem, as the LLMs are certainly trained using the online documentation.

My opinion is that including other programming languages into the project here is out-of-scope; simply because of the projected enormous workload. But let's see what others say.

Also if you plan to program such a benchmark on Java side, which might be much easier and yet would allow us to compare results if we use the same test-cases, I'm happy to contribute all knowledge I have in this context.

Thanks for the feedback!

tischi commented 3 weeks ago

I agree that testing the Java image analysis ecosystem would be very nice to have!

But I also agree with Robert, that at least I personally don't have the bandwidth to work on this.

scouser-27 commented 3 weeks ago

Forgive my lack of familiarity with the evaluation framework and the non-applicability to other languages, but how are test results decided upon and what makes the task of testing other languages such a challenge? (I'm not diminishing the work that went into this, I simply don't know what's involved!!)

For instance, I could imagine a task to calculate the Modulation Transfer Function of a certain digital x-ray image based on bar patterns. If I had an account for each of the LLMs, I could go to them and ask them to produce Java code based on the ImageJ libraries to perform this task. (I choose ImageJ because it is a very well documented and mature system)

I can copy and paste the question into each LLM to eliminate variations due to how the question is asked. I can then get the results, inspect the code and rate the answer that the LLM gave. I can then do this for Python, Groovy and Javascript. Perhaps I am oversimplifying the task, or the magnitude of effort involved?

haesleinhuepf commented 3 weeks ago

the task of testing other languages such a challenge?

The benchmark, the test cases and the framework are written in Python. Hence, testing python-functions is straightfoward. If we include Java, Fiji, Icy now, we would need to think about intallation, which software versions to use and re-write large parts of the framework in Java. In fact, we would start from scratch.

I can copy and paste the question into each LLM to eliminate variations due to how the question is asked. I can then get the results, inspect the code and rate the answer that the LLM gave. I can then do this for Python, Groovy and Javascript. Perhaps I am oversimplifying the task, or the magnitude of effort involved?

You would need to do this 10260 times. That's why we used the HumanEval Framework. It automates that.

I can then [...] rate the answer that the LLM gave.

The current benchmark is free of human interpretation. It would be great to keep it like this.

scouser-27 commented 3 weeks ago

Thanks for the explanation. I guess I figured on a more simplistic analysis like 1) Is the code sufficiently complete to achieve the task 2) Does the code produce the right answer under the various sample conditions (it would probably be helpful if there were multiple values to be determined) 3) Does the code omit libraries / functions used by other code (this could get messy)

Thanks for posting the results of this very interesting project.