mlopscommunity / open-questions-ai-quality

0 stars 0 forks source link

QA for non-deterministic GenAI #13

Open adamboazbecker opened 1 month ago

adamboazbecker commented 1 month ago

How should the team structure surrounding QA change in reaction to the non-deterministic nature of GenAI?

y27choi commented 1 month ago

The evaluation of a GenAI system is not as straight forward as the evaluation of traditional AI system where deterministic automatic metrics are used to measure the performance. The teams that build a product with a GenAI system mostly rely on human evaluation as their main evaluation strategy; therefore, I'm guessing the team structure surrounding QA will change accordingly.

andrew-lastmile commented 4 weeks ago

+1 to what @y27choi mentioned. Because of the non-deterministic nature of GenAI, human-in-the-loop is a general approach to applying human intuition on evaluating the performance of the output.

Another approach is feeding the results back into a LLM. Although this is a popular approach, my intuition is that there may be a better approach in the short-term future.

aransbotham commented 3 weeks ago

I wrote a blog post about how my team at Doximity has begun approaching this.

Ultimately, GenAI fueled processes and applications need a human-in-the-loop feedback mechanism in addition to various online/offline system evaluation metrics. These mechanisms often include (but aren't limited to):

AI-assisted evaluation metrics are powerful but subject to the same non-deterministic limitations that the GenAI process/application is and requires similar QA utilizing human annotations to verify reliability of the evaluation model.

I'm with you @andrew-lastmile re: better approaches availability in the short-term future -- GenAI evaluation frameworks are evolving rapidly.

In terms of team structure, ensuring you have a diverse team with developers, data professionals and subject matter expects is key. I wouldn't say that this is a change to team structure per say, but perhaps evolving the roles that each of these functions play in the QA of GenAI applications vs. deterministic AI systems.