openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Other
14.95k stars 2.6k forks source link

Idea for Evals: Sorting numbers with repeats and negatives #782

Open voynow opened 1 year ago

voynow commented 1 year ago

Note: I can develop this feature - creating the issue to get some feedback before developement

Is this diverse from the existing evals or is this too basic? I skimmed through the existing evals and I don't see anything similar except for complex number pattern (https://github.com/openai/evals/pull/223). I don't currently have GPT4 api access, although I do have chatGPT plus. Using the GPT4 engine I have tested this idea with the following examples:

Example 1

input: Sort the following numbers least to greatest (only include the numbers in your response): 3, 5, 2, 3, 10, 3, 5, 7, 7, 9, 10, 8, 7, 4, 5, 5, 6, 5, 1, 8, 1, 7, 4, 10, 4, 1, 5, 7, 3, 2

ideal: 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

response: 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

In this example, GPT4 miscounted the 5s and the 7s

Example 1 retry in a new window

input: Sort the following numbers least to greatest (only include the numbers in your response): 3, 5, 2, 3, 10, 3, 5, 7, 7, 9, 10, 8, 7, 4, 5, 5, 6, 5, 1, 8, 1, 7, 4, 10, 4, 1, 5, 7, 3, 2

ideal: 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

response: 1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 6, 7, 7, 7, 7, 7, 8, 8, 9, 10, 10, 10

In this example, GPT4 miscounted the 5s again

Example 2

input: Sort the following numbers least to greatest (only include the numbers in your response): 2, -4, -8, 9, -1, 10, 8, -7, 7, -1, -4, -5, -1, 0, 1, 8, 2, 0, -8, -10, 8, -5, -10, 7, -1, -3, -1, 8, 7, -5, -2, 1, -4, 7, 9, 6, -8, 10, -5, 5, -6, 4, -5, -2, -8, -1, -10, 1, -8, -4

ideal: -10, -10, -10, -8, -8, -8, -8, -8, -7, -6, -5, -5, -5, -5, -5, -4, -4, -4, -4, -3, -2, -2, -1, -1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 2, 2, 4, 5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10

response: -10, -10, -10, -8, -8, -8, -8, -8, -7, -6, -5, -5, -5, -5, -4, -4, -4, -4, -3, -2, -2, -1, -1, -1, -1, -1, 0, 0, 1, 1, 1, 2, 2, 4, 5, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 10, 10

Let me know what you all think. This would be my first contribution to open source - very exciting!

qrdlgit commented 1 year ago

@voynow Why did you close this? It looks good to me, but you should get Andrew's opinion.

voynow commented 1 year ago

@qrdlgit I found this (https://github.com/openai/evals/pull/93) PR that looks like it does what I was planning on doing. This has been opened for a while with no approval, maybe something is wrong with this one?

Am I correct in my understanding here? Didn't want to duplicate work/logic.

qrdlgit commented 1 year ago

Yeah, I saw that as well. TBH though, this seems like a great eval to me, but I'm just a user.

Sorting things like this is a very common use case that anyone might use GPT4 for.

For example, let's say you are a teacher and have a set of names or ids of students and you want to sort them in some way as a way of creating a 'fair order'. This is very common, as we all know.

I also very frequently use it for one off decoding/encoding tasks. It's a bit unnerving to see it fail so silently like this.

It'd be great to get an @andrew-openai perspective. It could be just a hard thing for them to fix at this point, which might be why they don't want an eval (yet), but I think it would help to hear that.

voynow commented 1 year ago

Great perspective thanks for adding that. I want to point out that I also just created https://github.com/openai/evals/issues/785 - so I can work on either one of these once we get some more perspectives here.

Ein-Tim commented 1 year ago

@voynow If you want more feedback on this issue, I suggest reopening it.

andrew-openai commented 1 year ago

Hey, thanks for the discussion!

I agree this is a good eval idea, and I agree with qrdlgit that it seems to be quite representative of common tasks. Also, thanks for bringing my attention to https://github.com/openai/evals/pull/93, it looks like a good eval and I'll probably merge it after testing it myself.

I like the examples you've given: sorting lists of students or encoding/decoding tasks. If you are interested in contributing evals of this flavor, having these domain specific variants are quite useful and I wouldn't be surprised if model performances vary across the "domain" that this basic capability is applied to. We've reduced the minimum count to 15 samples per eval, so this should be pretty quick to write by hand or collect variants of what you may already be using with the API or ChatGPT.

voynow commented 1 year ago

@andrew-openai Thanks for your feedback above. FYI I created two PRs based on your suggestions. See below:

Sorting rectangles by area: https://github.com/openai/evals/pull/878 Counting numbers greater than X: https://github.com/openai/evals/pull/856