Open adamboazbecker opened 1 month ago
The evaluation of a GenAI system is not as straight forward as the evaluation of traditional AI system where deterministic automatic metrics are used to measure the performance. The teams that build a product with a GenAI system mostly rely on human evaluation as their main evaluation strategy; therefore, I'm guessing the team structure surrounding QA will change accordingly.
+1 to what @y27choi mentioned. Because of the non-deterministic nature of GenAI, human-in-the-loop is a general approach to applying human intuition on evaluating the performance of the output.
Another approach is feeding the results back into a LLM. Although this is a popular approach, my intuition is that there may be a better approach in the short-term future.
I wrote a blog post about how my team at Doximity has begun approaching this.
Ultimately, GenAI fueled processes and applications need a human-in-the-loop feedback mechanism in addition to various online/offline system evaluation metrics. These mechanisms often include (but aren't limited to):
AI-assisted evaluation metrics are powerful but subject to the same non-deterministic limitations that the GenAI process/application is and requires similar QA utilizing human annotations to verify reliability of the evaluation model.
I'm with you @andrew-lastmile re: better approaches availability in the short-term future -- GenAI evaluation frameworks are evolving rapidly.
In terms of team structure, ensuring you have a diverse team with developers, data professionals and subject matter expects is key. I wouldn't say that this is a change to team structure per say, but perhaps evolving the roles that each of these functions play in the QA of GenAI applications vs. deterministic AI systems.
How should the team structure surrounding QA change in reaction to the non-deterministic nature of GenAI?