Are not merged PRs the result of irrelevancy to the model?

albukirky1 commented 1 year ago

Describe the feature or improvement you're requesting

Hi, this is not a suggestion, but rather a question.

I have been working on new ideas of evals lately, but none seem to be reviewed.

I was wondering - is this due to the PRs (eval ideas) not being important enough (or not big enough of contribution) for the model?

I'm currently unaware if my way of thinking of ideas of evaluations is right, and perhaps my PRs are not in the right direction, and my way of thinking (and perhaps others as well) should be adjusted in order to contribute better evals.

Example of a PR I lately sent: https://github.com/openai/evals/pull/841

Additional context

No response

qrdlgit commented 1 year ago

I downloaded all the merged PRs and asked GPT4 to summarize the common characteristics:

The merged evals cover a wide range of topics and skills, including:

Language understanding: Japanese, Russian, Dutch, Brazilian, Swedish, Greek, Bulgarian, Belarusian, Mongolian, Ukrainian, and Hebrew.
Medical knowledge: Japanese national medical exam, heart disease prediction, Russian medical, and MedMCQA.
Science and mathematics: pH calculation, general science reasoning, Mendelian inheritance, balancing chemical equations, and algebra word problems.
Spatial and logical reasoning: SVG understanding, three-point gene mapping, knot theory, physical rotation reasoning, LogiQA, and diagrammatical reasoning logic.
Legal knowledge: Illinois law claims, US tort law, and legal ethics.
Finance and economics: utility charge eval, financial math, and taxes eval.
Computer science and programming: bitwise eval, Forth Stack Simulator, and computer science theory.
Emotional intelligence: emotional intelligence evaluation.
Music theory: tempo and time signature.
Driving and navigation: Japanese driving license and lat-long-identify eval.
Chess: counting pieces left on the board and playing chess.
Miscellaneous: rhymes, emoji riddles, ROT13 strings, anagrams, counting bigrams, poker hand ranks, positive-binary-operations, chess, and sarcasm detection.

These evals assess various capabilities of the AI model, including language understanding, subject matter knowledge, problem-solving skills, spatial understanding, and emotional intelligence.

albukirky1 commented 1 year ago

@qrdlgit I'm not sure if the sample database of the merged PRs is large enough to conclude anything about what they look for, but this is a really nice observation. It does make sense that the most PRs wrap around languages, so that maybe the model will have better understanding of how to digest large text, rather than give better answers.

SkyaTura commented 1 year ago

@qrdlgit just for curiosity, did you tried to identify the patterns on the ~~ignored~~ PRs?

Edit: Actually, would be great to analyze every PR with their status, as open-active, open-stale, draft-active, draft-stale, closed-merged, closed-canceled, and so on

qrdlgit commented 1 year ago

@SkyaTura Yes, absolutely. For those serious about creating an eval here, there is definitely value in going back through all the PRs and reading them closely.

That said, it's possible there are extrinsic factors not mentioned in the documentation. It's sometimes difficult to predict what those might be.

SkyaTura commented 1 year ago

I was wonder what we could extract by iterating the whole PR history over LLM itself 🤔

That would be expensive, tho.

I'm still figuring out how this works, just found this repo a couple minutes ago.

qrdlgit commented 1 year ago

Not so much expensive, though perhaps a bit technically challenging. However, we can always ask GPT4, right?

Try this prompt:

I'd like to better understand why PRs are being merged and not merged. Is there a way I can extract all the PR data for a particular repository on github and feed it to GPT4 to summarize and analyze?

Depending on your particular skill set, you might need to get GPT4 to do a further breakdown on what it provides. Also, you will need to explain that you will be using the web interface for GPT4. I'd recommend using the git REST apis if possible.

SkyaTura commented 1 year ago

Indeed, I already get the PR history to try something, but there is not much beyond you mentioned before getting only by the title.

Probably sanitizing the descriptions and prompting them as well may provide better results, but it should be made programatically and I would need GPT-4 key for that, tho.

(I also don't have gptPlus yet, it's too much in my currency)

Maybe I'll try a proof this concept with the 3.5 and a handpicked selection later.

Sorry for deviating the original question of the issue, btw

qrdlgit commented 1 year ago

@SkyaTura I think your deviation was important and there needs to be more discussion around this topic - but you're right. I'll take the blame for the hijack here and so I have opened a discussion on this topic which I will continue there: https://github.com/openai/evals/discussions/882

eugene-kim-pipe17 commented 1 year ago

@andrew-openai is there anything you can share here?

I've submitted a couple eval PRs as well (https://github.com/openai/evals/pull/763 and https://github.com/openai/evals/pull/747). It would be great to know if the lack of response is simply due to a large backlog of PRs to assess (I'm sure you and your team are very busy) or if it's because of issues with the PR content / quality.

qrdlgit commented 1 year ago

One suggestion for folks at open ai, you might want to add an attribute to the checkbox:

[] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be reviewed, merged nor GPT-4 access granted.

Please note: This is not meant to be a complaint. I think we all understand that OpenAI is resource constrained and is trying to strike the right balance in terms of how it's providing access. However, I think it'd be fair to folks to let them know up front of the situation as they may have the expectation that they will get feedback on their PR.

I am working on a GPT prompt that could provide some reviewing / critiquing capability: https://github.com/openai/evals/discussions/882 I'm coming to terms with the fact that most of our PRs probably won't get merged into this repo, but I am concerned that there is a missed opportunity here. These PRs could be useful for other AI projects and so some review / feedback I think would help ensure that the evals are more well formed and generally useful.

If @andrew-openai or others could take a look at the prompt and provide some thoughts on how to improve it so we'd get some review capability, that would be very helpful.

andrew-openai commented 1 year ago

Hi folks, sorry for the pace of PR reviews, I actually took some time off this week which is why there haven't been many reviews in the past few days.

I was wondering - is this due to the PRs (eval ideas) not being important enough (or not big enough of contribution) for the model?

The general pattern has been: most eval PRs have good content but need iterating on the prompts to be meaningful evals. This, alongside recognizing that it takes quite some time and effort to open an Eval PR, I'm trying to make sure that each PR gets some feedback left on how to improve it rather than an outright rejection. So while I have looked at many evals, I haven't had the chance to leave that feedback on each one. We're well aware that this is slowing down the process that most PRs get reviewed.

In the next few weeks, there will be more people from our side available to review Eval PRs and leave that feedback, beyond just me. This should dramatically improve the pace by which you get feedback on your ideas, and close PRs.

Thanks for the patience. We love the enthusiasm and the contributions so far have been great. Until we get more help, I'll also resume reviewing Evals over the next few days.

eugene-kim-pipe17 commented 1 year ago

I appreciate the response and the transparency @andrew-openai !

openai / evals

Are not merged PRs the result of irrelevancy to the model? #873

Describe the feature or improvement you're requesting

Additional context