llm-edge / hal-9100

Edge full-stack LLM platform. Written in Rust
MIT License
367 stars 30 forks source link

benchmark openai assistants vs open source assistants #16

Open louis030195 opened 8 months ago

louis030195 commented 8 months ago

End goal would be to have something like this:

OpenAI Assistants API Benchmark

Model Name Code Interpreter Retrieval Function Calling JSON Mode Tool Switching Speed
GPT-4 5 5 5 5 5 5
GPT-3.5 4 4 4 4 4 4

Open Source Assistants API Benchmark

Model Name Code Interpreter Retrieval Function Calling JSON Mode Tool Switching Speed
Mistral 7B 5 5 5 5 5 5
LAMMA2 3 3 3 3 3 4
LLaVA 4 4 4 4 4 4
louis030195 commented 8 months ago

next:

louis030195 commented 8 months ago

https://gist.github.com/louis030195/3a937de928c553a0c6d9be3d92766c55 to finish

louis030195 commented 8 months ago

also fix the result writer which write duplicate files

louis030195 commented 8 months ago

other idea is to just write a bunch of rows with "input" "output" "expected" and use best practice llm scoring:

https://github.com/openai/evals

since assistants is basically software 3.0 (foundation models) + software 1.0 hacks and plumbing - might also have a column which is extra context received by the LLM or something like this

if anyone has ideas on how to derive llm benchmarking best practice to this project 🙏