Breakdown of which benchmarks were solved in paper

openai / mle-bench

MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering

Other

422 stars 39 forks source link

Breakdown of which benchmarks were solved in paper #9

Open SamuelSchmidgall opened 4 days ago

SamuelSchmidgall commented 4 days ago

Hello,

I noticed that in the paper there is no discussion of exactly which of the benchmarks your solutions were able to solve. I am also curious of the percent breakdown for Low, Medium, and High complexity (e.g. above medium / earning Bronze, Silver, Gold). I would greatly appreciate if this data could be provided.

Thank you, Sam

Fardeen786-eng commented 3 days ago

Hi Team, if along with that analysis, we get a clear classification of the competitions based on the types mentioned. And the medals achievement against the various types. I think there were multiple runs and seeds, but just for the best runs would do.

Regards, Fardeen

thesofakillers commented 2 days ago

we get a clear classification of the competitions based on the types mentioned

Hi, what do you mean exactly by this? Are you looking for the raw data that went into making Figure 6 in the report?

If you're simply looking for which comps are low/medium/high, you can check the splits in experiments/splits/

SamuelSchmidgall commented 1 day ago

Hi, not sure what the other user meant,

I was hoping to get a medal breakdown based on the difficulty tiers of the challenges. In table 2 you report the following measures: Made Submission (%), Valid Submission (%), Above Median (%), Bronze (%), Silver (%), Gold (%), Any Medal (%).

However, since you do not report which exact benchmarks the above metrics were earned for, there is no way to know the e.g. Above Median (%) for low complexity problems. I was hoping to be able to create a table with the following columns

Made Submission (%), Valid Submission (%), Above Median (%), Bronze (%), Silver (%), Gold (%), Any Medal (%)

However, reporting the breakdown based on complexity (e.g. low, medium, high).

Fardeen786-eng commented 1 day ago

we get a clear classification of the competitions based on the types mentioned

Hi, what do you mean exactly by this? Are you looking for the raw data that went into making Figure 6 in the report?

If you're simply looking for which comps are low/medium/high, you can check the splits in experiments/splits/

Yes, I am looking to find the raw data for Figure 6. And some analysis of the medals earned by the runs based on the various categories mentioned in Figure 6. ( Like the % medal for Tabular, Tex Classification...)