haesleinhuepf / human-eval-bia

Benchmarking Large Language Models for Bio-Image Analysis Code Generation
MIT License
21 stars 14 forks source link

Benchmarking o1-preview-2024-09-12 #135

Open haesleinhuepf opened 2 months ago

haesleinhuepf commented 2 months ago

If anyone knows anyone who is tier 5 in openai (@royerloic maybe ?), they could benchmark the new o1 model. I am just tier3 and have to wait...

https://x.com/OpenAI/status/1834278218888872042

https://platform.openai.com/docs/models/o1

jkh1 commented 2 months ago

It seems that this model has hidden tokens that you still get billed for (see note in the docs here: https://platform.openai.com/docs/guides/reasoning/how-reasoning-works) which may explain why it's for tier 5 users :smiley: This could become an expensive experiment.

haesleinhuepf commented 2 months ago

Yeah I know. I think claude does something similar. That's why I'm curious if it's really so much better than the models we tested so far.

haesleinhuepf commented 2 months ago

Ok, I have access now. Just FYI: My first 12 prompts cost $6.68 for o1-preview, hence running the entire benchmark would cost about $300. (updated)