potsawee / selfcheckgpt

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
MIT License
467 stars 54 forks source link

Possible annotation errors? #26

Open pramitchoudhary opened 8 months ago

pramitchoudhary commented 8 months ago

Hi there, awesome stuff, thanks for sharing.

Question: Are there possible annotation errors in the eval dataset (_wiki_bio_gpt3hallucination)?

Example: Observing example 6, index 11 Example: 6

array(['Akila Dananjaya (born 2 August 1995) is a Sri Lankan cricketer.',
       'He made his international debut for the Sri Lankan cricket team in August 2018.',
       'He is a right-arm off-spinner and right-handed batsman.',
       'Dananjaya made his first-class debut for Sri Lanka Army Sports Club in the 2013–14 Premier League Tournament.',
       'He was the leading wicket-taker in the tournament, taking 32 wickets in seven matches.',
       'He made his List A debut for Sri Lanka Army Sports Club in the 2014–15 Premier Limited Overs Tournament.',
       'In August 2018, he was named in the Sri Lankan squad for the 2018 Asia Cup.',
       'He made his One Day International (ODI) debut for Sri Lanka against Bangladesh on 15 September 2018.',
       "In October 2018, he was named in Sri Lanka's Test squad for their series against England, but he did not play.",
       "In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup.",
       'He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss'],
      dtype=object)

index =11

He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss

Ground Truth: major_inaccurate Should be: accurate? Observation: During brief analysis, only NLI scores aggregated at the sentence level seem to agree with the ground truth? The code is provided below.

from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")

eval_df = dataset['evaluation'].to_pandas()
eval_df.head()

# selecting the idx=5
_idx = 5
sentences = eval_df['gpt3_sentences'][_idx]
context = eval_df['gpt3_text'][5]

gt_labels = eval_df['annotation'][5]
print("labels:\n"{gt_labels})

print(f"Label for index: 11: {gt_labels[10]}")

context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context]
)
print(f"Passage (low score as expected): {nli_scores}")

nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples
)
print(f"Avg score on sentences is high: {nli_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
nli_scores_specific = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp]
)

print(f"Does it even work - comparing with the matched string (low score as expected)? : {nli_scores_specific}")

print("---- Using LLM - mistralai/Mistral-7B-Instruct-v0.2 ----") # quick eval purpose
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# mistralai/Mixtral-8x7B-Instruct-v0.
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"

# Basic Prompt
# "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "

selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)

context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context],
    verbose = True
)
print(f"Passage (low score as expected): {prompt_scores}")

prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples,
    verbose = True
)
print(f"Avg score on sentences (better than NLI): {prompt_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp],
    verbose = True
)

print(f"Does it even work - comparing with the matched string (low score as expected)?: {prompt_scores}")

Thoughts?

potsawee commented 6 months ago

Hi @pramitchoudhary

Sorry for my late reply & thanks for looking into the annotation. Yes, it is possible that some sentences were annotated incorrectly. We were trying our best to ensure that annotations (using actual Wikipedia articles as ground-truth) are as accurate as possible.

Feel free to revise the annotations if you spot errors