openai / mle-bench

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering
https://openai.com/index/mle-bench/
Other
524 stars 58 forks source link

Calculating scores #19

Closed SamuelSchmidgall closed 2 weeks ago

SamuelSchmidgall commented 3 weeks ago

Hello,

I am trying to calculate scores for the various techniques (shown below is for openhands) and have written the code shown below. I was wondering if I am calculating the scores correctly, and understanding the group_experiments csv correctly, or if the paper reports a different method? Thank you.

import csv, json
import os

with open("mlebench_eval/runs/run_group_experiments.csv", "r") as f:
    csv_lines = f.readlines()

competitions = dict()

experiment = "scaffolding-gpt4o-opendevin"

for line in csv_lines[1:]:
    model, file = line.strip().split(",")
    if model != f"{experiment}": continue
    with open(f"mlebench_eval/runs/{file}/{os.listdir('mlebench_eval/runs/' + file)[0]}", "r") as f:
        scores = json.load(f)["competition_reports"]
        for _comp in scores:
            if _comp["competition_id"] not in competitions:
                competitions[_comp["competition_id"]] = {
                    "scores": [],
                    "gold": [],
                    "silver": [],
                    "bronze": [],
                    "gold_threshold": _comp["gold_threshold"],
                    "silver_threshold": _comp["silver_threshold"],
                    "bronze_threshold": _comp["bronze_threshold"],
                }
            competitions[_comp["competition_id"]]["scores"].append(_comp["score"])
            competitions[_comp["competition_id"]]["gold"].append(_comp["gold_medal"])
            competitions[_comp["competition_id"]]["silver"].append(_comp["silver_medal"])
            competitions[_comp["competition_id"]]["bronze"].append(_comp["bronze_medal"])

for _comp in competitions:
    print(_comp, "\n", competitions[_comp], "\n",)
    notnones = [_ for _ in competitions[_comp]["scores"] if _ is not None]
    if len(notnones) > 0:
        print("Mean Score:", sum(notnones)/len(notnones), "\n",)
thesofakillers commented 2 weeks ago

Hey, this is roughly on the right track. However, we never directly compute "mean score", so can't comment on that. We simply average across the boolean columns instead of the scores i.e.

'submission_exists',
'valid_submission',
'above_median',
'bronze_medal',
'silver_medal',
'gold_medal',
'any_medal'

These should never be None/null. They're always true or false for an individual entry in a grading report file.

SamuelSchmidgall commented 2 weeks ago

Hmm, when I run this code here I am observing the following example output for some challenges:

Example of mixed None scores

rsna-miccai-brain-tumor-radiogenomic-classification 
 {'scores': [None, 0.51647, None, None], 'gold': [False, False, False, False], 'silver': [False, False, False, False], 'bronze': [False, False, False, False], 'gold_threshold': 0.60096, 'silver_threshold': 0.5815, 'bronze_threshold': 0.57449} 

Example of all None scores

tensorflow-speech-recognition-challenge 
 {'scores': [None, None, None, None, None], 'gold': [False, False, False, False, False], 'silver': [False, False, False, False, False], 'bronze': [False, False, False, False, False], 'gold_threshold': 0.90485, 'silver_threshold': 0.89627, 'bronze_threshold': 0.88793} 

Example of all valid scores

siim-isic-melanoma-classification 
 {'scores': [0.63462, 0.84561, 0.75364, 0.8074], 'gold': [False, False, False, False], 'silver': [False, False, False, False], 'bronze': [False, False, False, False], 'gold_threshold': 0.9455, 'silver_threshold': 0.9401, 'bronze_threshold': 0.937} 

Here is an example of a null score in the 2024-10-04T22-43-53-GMT_run-group_opendevin/2024-10-08T15-29-24-GMT_grading_report.json json file.

    {
      "competition_id": "tensorflow-speech-recognition-challenge",
      "score": null,
      "gold_threshold": 0.90485,
      "silver_threshold": 0.89627,
      "bronze_threshold": 0.88793,
      "median_threshold": 0.77722,
      "any_medal": false,
      "gold_medal": false,
      "silver_medal": false,
      "bronze_medal": false,
      "above_median": false,
      "submission_exists": false,
      "valid_submission": false,
      "is_lower_better": false,
      "created_at": "2024-10-08T15:29:24.398593",
      "submission_path": "None"
    },

Am I perhaps misinterpreting how scores and medals are calculated here?

thesofakillers commented 2 weeks ago

If a score is null, it probably meant the agent failed to make a sample submission, or submitted an invalid submission. In which case it didn't achieve a medal.

Like i said, the boolean columns will never be None/null. We take the mean over the boolean columns, not the score. You'll see we never report "mean score" in the paper. The score obviously can be None/null as you've pointed out. The values of boolean columns are determined by the score and the thresholds.

Let me know if you're still uncertain about something.

thesofakillers commented 2 weeks ago

If you re-read my original response carefully

These should never be None/null.

"These" is referring to the boolean columns i outlined

SamuelSchmidgall commented 2 weeks ago

Yes I realize you're not calculating average score in the original paper, rather you track medal earned % calculations, but I am trying to calculate average score here and was checking to make sure my method/code was valid according to how your csv was meant to be interpreted. It seems as though you're saying: "when the score is None, then a submission was not made during that run period". If this is an accurate statement then that is what I needed. If not, let me know how I may be misinterpreting the reported results.

james-aung commented 2 weeks ago

Hey @SamuelSchmidgall, yup if score is None then there wasn't a submission CSV produced in that run. Sometimes that was when there was a hardware failure on our infra during a run, but sometimes it was just an agent's failure to do so.

james-aung commented 2 weeks ago

Let me know if any of the scores you're trying to calculate aren't matching up with the paper, and we can take a closer look at those.

SamuelSchmidgall commented 2 weeks ago

Perfect, thank you @james-aung!

thesofakillers commented 2 weeks ago

Note, the score can be None also in cases where a submission was made, but the submission file was invalid, e.g. had too many rows or incorrect ids.

I recommend checking the grading logic:

Here https://github.com/openai/mle-bench/blob/main/mlebench/grade.py#L52-L95

And here https://github.com/openai/mle-bench/blob/main/mlebench/grade_helpers.py#L36-L55