Closed SamuelSchmidgall closed 2 weeks ago
Hey, this is roughly on the right track. However, we never directly compute "mean score", so can't comment on that. We simply average across the boolean columns instead of the scores i.e.
'submission_exists',
'valid_submission',
'above_median',
'bronze_medal',
'silver_medal',
'gold_medal',
'any_medal'
These should never be None
/null
. They're always true
or false
for an individual entry in a grading report file.
Hmm, when I run this code here I am observing the following example output for some challenges:
rsna-miccai-brain-tumor-radiogenomic-classification
{'scores': [None, 0.51647, None, None], 'gold': [False, False, False, False], 'silver': [False, False, False, False], 'bronze': [False, False, False, False], 'gold_threshold': 0.60096, 'silver_threshold': 0.5815, 'bronze_threshold': 0.57449}
tensorflow-speech-recognition-challenge
{'scores': [None, None, None, None, None], 'gold': [False, False, False, False, False], 'silver': [False, False, False, False, False], 'bronze': [False, False, False, False, False], 'gold_threshold': 0.90485, 'silver_threshold': 0.89627, 'bronze_threshold': 0.88793}
siim-isic-melanoma-classification
{'scores': [0.63462, 0.84561, 0.75364, 0.8074], 'gold': [False, False, False, False], 'silver': [False, False, False, False], 'bronze': [False, False, False, False], 'gold_threshold': 0.9455, 'silver_threshold': 0.9401, 'bronze_threshold': 0.937}
Here is an example of a null score in the 2024-10-04T22-43-53-GMT_run-group_opendevin/2024-10-08T15-29-24-GMT_grading_report.json
json file.
{
"competition_id": "tensorflow-speech-recognition-challenge",
"score": null,
"gold_threshold": 0.90485,
"silver_threshold": 0.89627,
"bronze_threshold": 0.88793,
"median_threshold": 0.77722,
"any_medal": false,
"gold_medal": false,
"silver_medal": false,
"bronze_medal": false,
"above_median": false,
"submission_exists": false,
"valid_submission": false,
"is_lower_better": false,
"created_at": "2024-10-08T15:29:24.398593",
"submission_path": "None"
},
Am I perhaps misinterpreting how scores and medals are calculated here?
If a score is null, it probably meant the agent failed to make a sample submission, or submitted an invalid submission. In which case it didn't achieve a medal.
Like i said, the boolean columns will never be None
/null
. We take the mean over the boolean columns, not the score. You'll see we never report "mean score" in the paper. The score obviously can be None
/null
as you've pointed out. The values of boolean columns are determined by the score and the thresholds.
Let me know if you're still uncertain about something.
If you re-read my original response carefully
These should never be
None/null
.
"These" is referring to the boolean columns i outlined
Yes I realize you're not calculating average score in the original paper, rather you track medal earned % calculations, but I am trying to calculate average score here and was checking to make sure my method/code was valid according to how your csv was meant to be interpreted. It seems as though you're saying: "when the score is None, then a submission was not made during that run period"
. If this is an accurate statement then that is what I needed. If not, let me know how I may be misinterpreting the reported results.
Hey @SamuelSchmidgall, yup if score is None
then there wasn't a submission CSV produced in that run. Sometimes that was when there was a hardware failure on our infra during a run, but sometimes it was just an agent's failure to do so.
Let me know if any of the scores you're trying to calculate aren't matching up with the paper, and we can take a closer look at those.
Perfect, thank you @james-aung!
Note, the score can be None
also in cases where a submission was made, but the submission file was invalid, e.g. had too many rows or incorrect ids.
I recommend checking the grading logic:
Here https://github.com/openai/mle-bench/blob/main/mlebench/grade.py#L52-L95
And here https://github.com/openai/mle-bench/blob/main/mlebench/grade_helpers.py#L36-L55
Hello,
I am trying to calculate scores for the various techniques (shown below is for openhands) and have written the code shown below. I was wondering if I am calculating the scores correctly, and understanding the group_experiments csv correctly, or if the paper reports a different method? Thank you.