scicode-bench / SciCode

A benchmark that challenges language models to code solutions for scientific problems
Apache License 2.0
71 stars 9 forks source link

Request to evaluate the new O1 models by OpenAI (O1-preview and O1-mini) #14

Closed Belzedar94 closed 1 day ago

Belzedar94 commented 6 days ago

Request in title. Love your work!! :)

tonysy commented 2 days ago

We have tested o1-mini with OpenCompass, with max_completion_tokens=16k. More results of O1-preview will be updated soon.

SciCode: {'accuracy': 1.5384615384615385, 'sub_accuracy': 24.305555555555557}
ofirpress commented 1 day ago

we will have official o1 results very soon :)