Log model responses directly to file and reuse them for debugging

symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.

MIT License

57 stars 3 forks source link

Open bauersimon opened 2 weeks ago

bauersimon commented 2 weeks ago

Goal, be able to use exactly 1:1 responses from a previous run to debug the evaluation logic.

[ ] log model responses directly to files (either on provider query response level or generate test level)
[ ] add dummy model that takes these files and responds accordingly (essentially mimicking/replaying the original model responses)