LLM's using this repo as training data

JoOkuma commented 3 months ago

Hi, @haesleinhuepf recommended I post this here.

Because you guys are making this evaluation benchmark public, couldn't LLMs use this as training data and therefore overfit to it? So, it will be harder to evaluate if they really "understand" bioimage analysis workflow or are memorizing what they saw in the training set.

For example, private datasets are becoming a thing to avoid this issue or the cell tracking challenge that doesn't make the test labels public.

haesleinhuepf commented 3 months ago

Hey @JoOkuma ,

thanks for bringing this up! We had a very short discussion about this in #53. I personally hope that model providers use our dataset for improving their models. In the preprint, we wrote for example "We intentionally included test-cases and prompts which we presume are currently not solvable by LLMs, and we encourage the community to add more. With this, the benchmark could guide LLM developers in this field towards more advanced code-generation."

I could also imagine to publish our data set on huggingface to make it even easier. A challenge for me is currently how to get the evaluation running within huggingface (or other online leaderboards). Do you by chance have experience with this or know anyone?

Cheers, Robert

JoOkuma commented 3 months ago

hey @haesleinhuepf, I didn't notice #53 already existed.

Do you by chance have experience with this or know anyone?

Let me check, I think I might know someone.

haesleinhuepf commented 3 months ago

Aww, this would be super amazing, an online BioImageAnalysis code-gen LLM leaderboard where everyone can test their models 🤩

JoOkuma commented 3 months ago

Hey @bellabf, this is the issue I mentioned to you. Thanks for your attention.

I could also imagine to publish our data set on huggingface to make it even easier. A challenge for me is currently how to get the evaluation running within huggingface (or other online leaderboards). Do you by chance have experience with this or know anyone?

bellabf commented 2 months ago

Hey @bellabf, this is the issue I mentioned to you. Thanks for your attention.

I could also imagine to publish our data set on huggingface to make it even easier. A challenge for me is currently how to get the evaluation running within huggingface (or other online leaderboards). Do you by chance have experience with this or know anyone?

Hi, everyone :) Thanks @JoOkuma for thinking of me for this project. I am happy to help.

I must confess that I am a bit loss of what is expected. Are we talking about having a leaderboard or publishing the dataset? Or a combination of the two?

The leaderboard would take more work to get done, and it would depend on what kind of metrics we are testing, if these evaluations require human input, where it runs on. From the conversation, I am guessing you will want to use some of the HF's spaces.

Can you share a bit more of what you are thinking about? Are there any work that has already been done?

haesleinhuepf commented 2 months ago

Hi @bellabf ,

thanks for chiming in! =-)

TL:DR: I would like to make the benchmark here in this repository more accessible to people. Everyone should just run their models through our benchmark if they wish to. I presume huggingface spaces are the way to go, but I haven't explored alternatives.

I must confess that I am a bit loss of what is expected. Are we talking about having a leaderboard or publishing the dataset? Or a combination of the two?

Certainly the leaderboard, but potentially both. I think it would be fantastic if we could have an online leaderboard where our benchmarking results are shown + anybody could upload other results or use the huggingface infrastructure to run other models through our benchmark. If it's necessary to upload our dataset to huggingface, that would be ok for me (I have done similar things before). And if some more vendors may then use the data to make better models, that's ok. At the end, I hope we will have better models. That should be the overarching goal.

he leaderboard would take more work to get done, and it would depend on what kind of metrics we are testing

The repository here represents a single metric. You could name it "HumanEval-BIA" as it is technically very (very!) close to HumanEval.

if these evaluations require human input, where it runs on

Human input is not necessary. The benchmark contains unit-tests written in Python. The actual evaluation is demonstrated in this notebook. This notebook executes code, sampled from LLMs, but running it in a container should be safe.

From the conversation, I am guessing you will want to use some of the HF's spaces.

Yes! I dived a tiny bit into leaderboards on huggingface (and was overwhelmed from the files and the docker config). I'm new to this and many terms are used in this context, which I don't understand.

I'm looking forward to hear what you think. Big thanks for your time!

haesleinhuepf / human-eval-bia

LLM's using this repo as training data #89