be able to version control and monitor LLM answers
be able to see quality of answers, refine and improve prompts over time
as expert user play with different prompts and see if I can improve
as a developer run a unit test/regression test to see if LLM can be deployed
Witness scores through time to check regression/performance
Short term maintenance approach:
Sort caddy messages v.s. caddy responses dynamo db table to include the various additional desired tags (i.e. routing, eval scores, recieved timestamps, etc)
Add Evaluation metrics (as in KM portal) into the caddy responses table
run evaluation metrics on current set of 20 caddy questions and generated answers
bring eval into Ci/CD as basic unit test
Long term maintenance approach:
Take all topics/sample of queries
Use caddy to generate answer for each question
Crowdsource to allow advisors/supervisors across LCAs to refine and create 'model' answers
Through time, measure incoming queries against model queries and look at drift
separate platform for caddy?
Separate project on exprt/crowdsource management of LLM answers in Public sector
We need to review how we do this broadly.