the-crypt-keeper / can-ai-code

Self-evaluating interview for AI coders
https://huggingface.co/spaces/mike-ravkine/can-ai-code-results
MIT License
513 stars 29 forks source link

Evaluate Open AI GPT4? (or ChatGPT 4) #121

Closed adamerose closed 9 months ago

adamerose commented 9 months ago

Is there a reason this leaderboard doesn't include OpenAI's GPT4 model? Can it be added?

the-crypt-keeper commented 9 months ago

3.5-turbo gets one point shy of 100%, so there is no need to use gpt4 on this suite.

the-crypt-keeper commented 9 months ago

Took a fresh look at this @adamerose and it was unclear what version of 3.5 I had previously tested so I ran a clean sweep of all available chatgpt models this morning:

gpt-3.5-turbo-0301, gpt-3.5-turbo-0613, gpt-3.5-turbo-1106 gpt-4-0613, gpt-4-1106-preview

HF leaderboard has been updated with results:

gpt-3.5-turbo-1106 is the new reference and achieves a perfect 100% score on both suites.

Somewhat unexpectedly, gpt4 sits in the high 90s but seems to have trouble following variable and function naming instructions.