QA for non-deterministic GenAI

adamboazbecker commented 1 month ago

How should the team structure surrounding QA change in reaction to the non-deterministic nature of GenAI?

y27choi commented 1 month ago

The evaluation of a GenAI system is not as straight forward as the evaluation of traditional AI system where deterministic automatic metrics are used to measure the performance. The teams that build a product with a GenAI system mostly rely on human evaluation as their main evaluation strategy; therefore, I'm guessing the team structure surrounding QA will change accordingly.

andrew-lastmile commented 4 weeks ago

+1 to what @y27choi mentioned. Because of the non-deterministic nature of GenAI, human-in-the-loop is a general approach to applying human intuition on evaluating the performance of the output.

Another approach is feeding the results back into a LLM. Although this is a popular approach, my intuition is that there may be a better approach in the short-term future.

aransbotham commented 3 weeks ago

I wrote a blog post about how my team at Doximity has begun approaching this.

Ultimately, GenAI fueled processes and applications need a human-in-the-loop feedback mechanism in addition to various online/offline system evaluation metrics. These mechanisms often include (but aren't limited to):

A "golden dataset" that can be added to over time.
End-user feedback (e.g. thumbs up/thumbs down, quality-adjacent metrics like user copy/saves/shares)
Periodic review of output by subject matter experts.
Analysis of multiple repetitions of model responses to enable averaging the response quality to understand the variance in your model for various categories of user inputs.

AI-assisted evaluation metrics are powerful but subject to the same non-deterministic limitations that the GenAI process/application is and requires similar QA utilizing human annotations to verify reliability of the evaluation model.

I'm with you @andrew-lastmile re: better approaches availability in the short-term future -- GenAI evaluation frameworks are evolving rapidly.

In terms of team structure, ensuring you have a diverse team with developers, data professionals and subject matter expects is key. I wouldn't say that this is a change to team structure per say, but perhaps evolving the roles that each of these functions play in the QA of GenAI applications vs. deterministic AI systems.

mlopscommunity / open-questions-ai-quality

QA for non-deterministic GenAI #13