Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in Holistic Evaluation of Text-to-Image Models (HEIM) (https://arxiv.org/abs/2311.04287).
Created AzureOpenAIClient class which enables us to run HELM against OpenAI models hosted in Azure
Tested against llm-benchmarking Azure OpenAI deployment (GPT-3.5-turbo)
Benchmarked mc-defence-qa scenario against model hosted in Azure to test this client is working (full datasets - 238 samples)
Number of TODO items within the class to address before this work is complete
How to use the AzureOpenAIClient to run HELM benchmarks:
Add model_deployments.yaml & model_metadata.yaml files in prod_env/ directory (see repo on Azure VM for example)
export the following environment variables: AZURE_OPENAI_KEY, AZURE_OPENAI_ENDPOINT, AZURE_DEPLOYMENT_NAME
Run 'helm-run', ensuring that the model specified in 'run_entries.conf' matches the model name specified in 'model_deployments.yaml' - (azure/gpt-35-turbo-0301 for my tests so far)
This will run benchmarks as normal, using an OpenAI model hosted in Azure.
Created AzureOpenAIClient class which enables us to run HELM against OpenAI models hosted in Azure
How to use the AzureOpenAIClient to run HELM benchmarks:
This will run benchmarks as normal, using an OpenAI model hosted in Azure.