benchmark openai assistants vs open source assistants

llm-edge / hal-9100

Edge full-stack LLM platform. Written in Rust

MIT License

367 stars 30 forks source link

benchmark openai assistants vs open source assistants #16

Open louis030195 opened 8 months ago

louis030195 commented 8 months ago

End goal would be to have something like this:

OpenAI Assistants API Benchmark

Model Name	Code Interpreter	Retrieval	Function Calling	JSON Mode	Tool Switching	Speed
GPT-4	5	5	5	5	5	5
GPT-3.5	4	4	4	4	4	4

Open Source Assistants API Benchmark

Model Name	Code Interpreter	Retrieval	Function Calling	JSON Mode	Tool Switching	Speed
Mistral 7B	5	5	5	5	5	5
LAMMA2	3	3	3	3	3	4
LLaVA	4	4	4	4	4	4

louis030195 commented 8 months ago

make sure process dont crash if request fial or something and write the json
more tests, different models, domains, use cases, etc.
scale it
make it human readable

louis030195 commented 8 months ago

https://gist.github.com/louis030195/3a937de928c553a0c6d9be3d92766c55 to finish

louis030195 commented 8 months ago

also fix the result writer which write duplicate files

louis030195 commented 8 months ago

other idea is to just write a bunch of rows with "input" "output" "expected" and use best practice llm scoring:

https://github.com/openai/evals

since assistants is basically software 3.0 (foundation models) + software 1.0 hacks and plumbing - might also have a column which is extra context received by the LLM or something like this

if anyone has ideas on how to derive llm benchmarking best practice to this project 🙏