Evaluation Framework to evaluate the quality and performance of Open AI models

caramelmacch commented 1 year ago

FMOps(Fundamental Model Ops) is a critical aspect of the lifecycle management of the performance and quality of Open AI based systems. It consists of several steps to ensure effective system operation and improvement. The evaluation framework plays an important role in the experimentation phase, facilitating rapid experimentation and providing valuable insights.

Based on my recent project, I'd like contribute to SK by providing a sample implementation of the evaluation framework for NL-to-SQL case. I already designed and developed the evaluation framework to evaluate the quality and the performance of NL-to-SQL capability of AOAI. The framework provides multiple statistical metrics and evaluation techniques to assess the performance of AOAI models in NL-to-SQL tasks. It consists of query exact match, semantic accuracy(=result exact match), query syntax validity, query diff, Levenshtein score, and cosine similarity, which provide valuable insights into the performance and accuracy of the NL-to-SQL system.

This is the sample result reports.

FYI, I also implemented SQL Server connector to check result set, query validity and query performance. It also queries to Embeddings model(text-embeddings-ada002) to get embeddings for Cosine Similarity.

Please let me move forward to make PRs. I also need suggestions where to push it. (I think "samples" would be ok)

evchaki commented 1 year ago

@caramelmacch - thanks for thinking about this! You should also check this out - https://devblogs.microsoft.com/semantic-kernel/use-natural-language-to-execute-sql-queries/ to make sure some of it is not duplicated.

caramelmacch commented 1 year ago

@evchaki Thanks for your feedback, This is not mainly about NL-to-SQL but evaluation framework which can be used in any generic use cases. NL-to-SQL is just a case that would explain how to evaluate generated completions from OpenAI. As described above, it consists of loading evaluation dataset, assessing metrics(by using evaluation techniques like exact match, Levenshtein, Cosine Similarity and so on) and generating report. FYI, it will be a Jupiter notebook, which uses custom connectors and plugins for calculating metrics. This sample could be used as a reference implementation in any use case that requires evaluation of the quality and performance of the system.

evchaki commented 1 year ago

@caramelmacch - Notebook and connectors sounds great! I am looking forward to seeing this.

matthewbolanos commented 9 months ago

Closing this issue since it was created a few months back. Ideally this is supported with Azure AI Studio.

microsoft / semantic-kernel

Evaluation Framework to evaluate the quality and performance of Open AI models #2372