Open louis030195 opened 8 months ago
next:
also fix the result writer which write duplicate files
other idea is to just write a bunch of rows with "input" "output" "expected" and use best practice llm scoring:
https://github.com/openai/evals
since assistants is basically software 3.0 (foundation models) + software 1.0 hacks and plumbing - might also have a column which is extra context received by the LLM or something like this
if anyone has ideas on how to derive llm benchmarking best practice to this project 🙏
End goal would be to have something like this:
OpenAI Assistants API Benchmark
Open Source Assistants API Benchmark