Open albukirky1 opened 1 year ago
I downloaded all the merged PRs and asked GPT4 to summarize the common characteristics:
The merged evals cover a wide range of topics and skills, including:
These evals assess various capabilities of the AI model, including language understanding, subject matter knowledge, problem-solving skills, spatial understanding, and emotional intelligence.
@qrdlgit I'm not sure if the sample database of the merged PRs is large enough to conclude anything about what they look for, but this is a really nice observation. It does make sense that the most PRs wrap around languages, so that maybe the model will have better understanding of how to digest large text, rather than give better answers.
@qrdlgit just for curiosity, did you tried to identify the patterns on the ignored PRs?
Edit: Actually, would be great to analyze every PR with their status, as open-active, open-stale, draft-active, draft-stale, closed-merged, closed-canceled, and so on
@SkyaTura Yes, absolutely. For those serious about creating an eval here, there is definitely value in going back through all the PRs and reading them closely.
That said, it's possible there are extrinsic factors not mentioned in the documentation. It's sometimes difficult to predict what those might be.
I was wonder what we could extract by iterating the whole PR history over LLM itself 🤔
That would be expensive, tho.
I'm still figuring out how this works, just found this repo a couple minutes ago.
Not so much expensive, though perhaps a bit technically challenging. However, we can always ask GPT4, right?
Try this prompt:
I'd like to better understand why PRs are being merged and not merged. Is there a way I can extract all the PR data for a particular repository on github and feed it to GPT4 to summarize and analyze?
Depending on your particular skill set, you might need to get GPT4 to do a further breakdown on what it provides. Also, you will need to explain that you will be using the web interface for GPT4. I'd recommend using the git REST apis if possible.
Indeed, I already get the PR history to try something, but there is not much beyond you mentioned before getting only by the title.
Probably sanitizing the descriptions and prompting them as well may provide better results, but it should be made programatically and I would need GPT-4 key for that, tho.
(I also don't have gptPlus yet, it's too much in my currency)
Maybe I'll try a proof this concept with the 3.5 and a handpicked selection later.
Sorry for deviating the original question of the issue, btw
@SkyaTura I think your deviation was important and there needs to be more discussion around this topic - but you're right. I'll take the blame for the hijack here and so I have opened a discussion on this topic which I will continue there: https://github.com/openai/evals/discussions/882
@andrew-openai is there anything you can share here?
I've submitted a couple eval PRs as well (https://github.com/openai/evals/pull/763 and https://github.com/openai/evals/pull/747). It would be great to know if the lack of response is simply due to a large backlog of PRs to assess (I'm sure you and your team are very busy) or if it's because of issues with the PR content / quality.
One suggestion for folks at open ai, you might want to add an attribute to the checkbox:
[] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be reviewed, merged nor GPT-4 access granted.
Please note: This is not meant to be a complaint. I think we all understand that OpenAI is resource constrained and is trying to strike the right balance in terms of how it's providing access. However, I think it'd be fair to folks to let them know up front of the situation as they may have the expectation that they will get feedback on their PR.
I am working on a GPT prompt that could provide some reviewing / critiquing capability: https://github.com/openai/evals/discussions/882 I'm coming to terms with the fact that most of our PRs probably won't get merged into this repo, but I am concerned that there is a missed opportunity here. These PRs could be useful for other AI projects and so some review / feedback I think would help ensure that the evals are more well formed and generally useful.
If @andrew-openai or others could take a look at the prompt and provide some thoughts on how to improve it so we'd get some review capability, that would be very helpful.
Hi folks, sorry for the pace of PR reviews, I actually took some time off this week which is why there haven't been many reviews in the past few days.
I was wondering - is this due to the PRs (eval ideas) not being important enough (or not big enough of contribution) for the model?
The general pattern has been: most eval PRs have good content but need iterating on the prompts to be meaningful evals. This, alongside recognizing that it takes quite some time and effort to open an Eval PR, I'm trying to make sure that each PR gets some feedback left on how to improve it rather than an outright rejection. So while I have looked at many evals, I haven't had the chance to leave that feedback on each one. We're well aware that this is slowing down the process that most PRs get reviewed.
In the next few weeks, there will be more people from our side available to review Eval PRs and leave that feedback, beyond just me. This should dramatically improve the pace by which you get feedback on your ideas, and close PRs.
Thanks for the patience. We love the enthusiasm and the contributions so far have been great. Until we get more help, I'll also resume reviewing Evals over the next few days.
I appreciate the response and the transparency @andrew-openai !
Describe the feature or improvement you're requesting
Hi, this is not a suggestion, but rather a question.
I have been working on new ideas of evals lately, but none seem to be reviewed.
I was wondering - is this due to the PRs (eval ideas) not being important enough (or not big enough of contribution) for the model?
I'm currently unaware if my way of thinking of ideas of evaluations is right, and perhaps my PRs are not in the right direction, and my way of thinking (and perhaps others as well) should be adjusted in order to contribute better evals.
Example of a PR I lately sent: https://github.com/openai/evals/pull/841
Additional context
No response