vectara / hallucination-leaderboard

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents
https://vectara.com
Apache License 2.0
1.12k stars 42 forks source link
generative-ai hallucinations llm

Hallucination Leaderboard

Public LLM leaderboard computed using Vectara's Hughes Hallucination Evaluation Model. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.

Also, feel free to check out our hallucination leaderboard in Hugging Face.

In loving memory of Simon Mark Hughes...

Last updated on July 19th, 2024

Model Hallucination Rate Factual Consistency Rate Answer Rate Average Summary Length (Words)
GPT 4 Turbo 2.5 % 97.5 % 100.0 % 86.2
Snowflake Arctic 2.6 % 97.4 % 100.0 % 68.7
Intel Neural Chat 7B 2.8 % 97.2 % 89.5 % 57.6
GPT 4 3.0 % 97.0 % 100.0 % 81.1
GPT 4o mini 3.1 % 96.9 % 100.0 % 76.3
Microsoft Orca-2-13b 3.2 % 96.8 % 100.0 % 66.2
GPT 3.5 Turbo 3.5 % 96.5 % 99.6 % 84.1
GPT 4o 3.7 % 96.3 % 100.0 % 77.8
Cohere Command R Plus 3.8 % 96.2 % 100.0 % 71.2
Mixtral 8x22B 3.8 % 96.2 % 99.9 % 92.0
Cohere Command R 3.9 % 96.1 % 99.9 % 51.2
Microsoft Phi-3-mini-128k 4.1 % 95.9 % 100.0 % 60.1
Mistral 7B Instruct-v0.2 4.5 % 95.5 % 100.0 % 106.1
Llama 3 70B 4.5 % 95.5 % 99.2 % 68.5
Google Gemini 1.5 Pro 4.6 % 95.4 % 89.3 % 82.1
Google Gemini Pro 4.8 % 95.2 % 98.4 % 89.5
Microsoft WizardLM-2-8x22B 5.0 % 95.0 % 99.9 % 140.8
Microsoft Phi-3-mini-4k 5.1 % 94.9 % 100.0 % 86.8
Llama 2 70B 5.1 % 94.9 % 99.9 % 84.9
Google Gemini 1.5 Flash 5.3 % 94.7 % 98.1 % 62.8
Llama 3 8B 5.4 % 94.6 % 99.8 % 79.7
Llama 2 7B 5.6 % 94.4 % 99.6 % 119.9
Llama 2 13B 5.9 % 94.1 % 99.8 % 82.1
Anthropic Claude 3 Sonnet 6.0 % 94.0 % 100.0 % 108.5
Databricks DBRX Instruct 6.1 % 93.9 % 100.0 % 85.9
Google Gemma-1.1-7b-it 6.3 % 93.7 % 100.0 % 64.3
Anthropic Claude 3.5 Sonnet 6.7 % 93.3 % 100.0 % 103.0
Google Gemma-2-9b-it 7.0 % 93.0 % 100.0 % 70.2
Anthropic Claude 3 Opus 7.4 % 92.6 % 95.5 % 92.1
Google Gemma-7b-it 7.5 % 92.5 % 100.0 % 113.0
Cohere-Chat 7.5 % 92.5 % 98.0 % 74.4
Cohere 8.5 % 91.5 % 99.8 % 59.8
Anthropic Claude 2 8.5 % 91.5 % 99.3 % 87.5
Microsoft Phi 2 8.5 % 91.5 % 91.5 % 80.8
Google Palm 2 8.6 % 91.4 % 99.8 % 86.6
Mixtral 8x7B 9.3 % 90.7 % 99.9 % 90.7
Amazon Titan Express 9.4 % 90.6 % 99.5 % 98.4
Mistral 7B Instruct-v0.1 9.4 % 90.6 % 98.7 % 96.1
Google Palm 2 Chat 10.0 % 90.0 % 100.0 % 66.2
Google Gemma-1.1-2b-it 11.2 % 88.8 % 100.0 % 66.8
Google flan-t5-large 15.8 % 84.2 % 99.3 % 20.9
tiiuae falcon-7b-instruct 16.2 % 83.8 % 90.0 % 75.5
Apple OpenELM-3B-Instruct 22.4 % 77.6 % 99.3 % 47.2

Model

You can find the model used to compute this leaderboard open sourced for commercial use on Hugging Face and Kaggle, along with instructions on how to use the model.

Data

See link for the generated summaries we used to evaluate the models with.

Prior Research

Much prior work in this area has been done. For some of the top papers in this area (factual consistency in summarization) please see here:

For a very comprehensive list, please see here - https://github.com/EdinburghNLP/awesome-hallucination-detection. The methods described in the following section use protocols established in those papers, amongst many others.

Methodology

For a detailed explanation of the work that went into this model please refer to our blog post on the release: Cut the Bull…. Detecting Hallucinations in Large Language Models.

To determine this leaderboard, we trained a model to detect hallucinations in LLM outputs, using various open source datasets from the factual consistency research into summarization models. Using a model that is competitive with the best state of the art models, we then fed 1000 short documents to each of the LLMs above via their public APIs and asked them to summarize each short document, using only the facts presented in the document. Of these 1000 documents, only 831 document were summarized by every model, the remaining documents were rejected by at least one model due to content restrictions. Using these 831 documents, we then computed the overall factual consistency rate (no hallucinations) and hallucination rate (100 - accuracy) for each model. The rate at which each model refuses to respond to the prompt is detailed in the 'Answer Rate' column. None of the content sent to the models contained illicit or 'not safe for work' content but the present of trigger words was enough to trigger some of the content filters. The documents were taken primarily from the CNN / Daily Mail Corpus. We used a temperature of 0 when calling the LLMs.

We evaluate summarization factual consistency rate instead of overall factual accuracy because it allows us to compare the model's response to the provided information. In other words, is the summary provided 'factually consistent' with the source document. Determining hallucinations is impossible to do for any ad hoc question as it's not known precisely what data every LLM is trained on. In addition, having a model that can determine whether any response was hallucinated without a reference source requires solving the hallucination problem and presumably training a model as large or larger than these LLMs being evaluated. So we instead chose to look at the hallucination rate within the summarization task as this is a good analogue to determine how truthful the models are overall. In addition, LLMs are increasingly used in RAG (Retrieval Augmented Generation) pipelines to answer user queries, such as in Bing Chat and Google's chat integration. In a RAG system, the model is being deployed as a summarizer of the search results, so this leaderboard is also a good indicator for the accuracy of the models when used in RAG systems.

Prompt Used

You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question 'Provide a concise summary of the following passage, covering the core pieces of information described.' <PASSAGE>'

When calling the API, the <PASSAGE> token was then replaced with the source document (see the 'source' column in leaderboard-summaries.csv ).

API Integration Details

Below is a detailed overview of the models integrated and their specific endpoints:

OpenAI Models

Llama Models

Cohere Models

For more information about Cohere's models, refer to their website.

Anthropic Model

Mistral AI Models on Hugging Face

Google Palm Models via Vertex AI

For an in-depth understanding of each model's version and lifecycle, especially those offered by Google, please refer to Model Versions and Lifecycles on Vertex AI.

Titan Models on Amazon Bedrock

Microsoft Models

Google Models on Hugging Face

tiiuae Models on Hugging Face

Intel Models on Hugging Face

Databricks Model

Snowflake Model

Apple Model

Frequently Asked Questions

Coming Soon