Open curran opened 1 month ago
percent of the tasks which had all tests passing
This is the key metric used by the Aider benchmark system, which we can look to for inspiration.
I really like how it runs in Docker. It executes Python as part of the tests, and we could do the same for JS using Node (or Bun or others).
So I got the Aider benchmark running locally!
I also found out about a JavaScript version of the Exercism challenges that it uses: https://github.com/exercism/javascript/tree/main/exercises
The VZCode AI Assist feature prompting system is currently based on trial and error.
There are no tests or benchmarks to evaluate the quality of the prompt and test it under various scenarios.
We need to introduce some sort of test harness or benchmarking system for the AI aspect of VZCode, so that we can continually evolve the prompt and make it work well for various AI models.