AI Benchmark System - Githubissues

vizhub-core / vzcode

Mob Programming Code Editor

MIT License

71 stars 14 forks source link

AI Benchmark System #847

Open curran opened 1 month ago

curran commented 1 month ago

The VZCode AI Assist feature prompting system is currently based on trial and error.

There are no tests or benchmarks to evaluate the quality of the prompt and test it under various scenarios.

We need to introduce some sort of test harness or benchmarking system for the AI aspect of VZCode, so that we can continually evolve the prompt and make it work well for various AI models.

curran commented 1 month ago

percent of the tasks which had all tests passing

This is the key metric used by the Aider benchmark system, which we can look to for inspiration.

I really like how it runs in Docker. It executes Python as part of the tests, and we could do the same for JS using Node (or Bun or others).

curran commented 1 month ago

So I got the Aider benchmark running locally!

I also found out about a JavaScript version of the Exercism challenges that it uses: https://github.com/exercism/javascript/tree/main/exercises