Possible annotation errors?

Hi there, awesome stuff, thanks for sharing.

Question: Are there possible annotation errors in the eval dataset (_wiki_bio_gpt3hallucination)?

Example: Observing example 6, index 11 Example: 6

array(['Akila Dananjaya (born 2 August 1995) is a Sri Lankan cricketer.',
       'He made his international debut for the Sri Lankan cricket team in August 2018.',
       'He is a right-arm off-spinner and right-handed batsman.',
       'Dananjaya made his first-class debut for Sri Lanka Army Sports Club in the 2013–14 Premier League Tournament.',
       'He was the leading wicket-taker in the tournament, taking 32 wickets in seven matches.',
       'He made his List A debut for Sri Lanka Army Sports Club in the 2014–15 Premier Limited Overs Tournament.',
       'In August 2018, he was named in the Sri Lankan squad for the 2018 Asia Cup.',
       'He made his One Day International (ODI) debut for Sri Lanka against Bangladesh on 15 September 2018.',
       "In October 2018, he was named in Sri Lanka's Test squad for their series against England, but he did not play.",
       "In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup.",
       'He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss'],
      dtype=object)

index =11

He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss

Ground Truth: major_inaccurate Should be: accurate? Observation: During brief analysis, only NLI scores aggregated at the sentence level seem to agree with the ground truth? The code is provided below.

from datasets import load_dataset
dataset = load_dataset("potsawee/wiki_bio_gpt3_hallucination")

eval_df = dataset['evaluation'].to_pandas()
eval_df.head()

# selecting the idx=5
_idx = 5
sentences = eval_df['gpt3_sentences'][_idx]
context = eval_df['gpt3_text'][5]

gt_labels = eval_df['annotation'][5]
print("labels:\n"{gt_labels})

print(f"Label for index: 11: {gt_labels[10]}")

context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context]
)
print(f"Passage (low score as expected): {nli_scores}")

nli_scores = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples
)
print(f"Avg score on sentences is high: {nli_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
nli_scores_specific = selfcheck_nli.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp]
)

print(f"Does it even work - comparing with the matched string (low score as expected)? : {nli_scores_specific}")

print("---- Using LLM - mistralai/Mistral-7B-Instruct-v0.2 ----") # quick eval purpose
from selfcheckgpt.modeling_selfcheck import SelfCheckLLMPrompt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# mistralai/Mixtral-8x7B-Instruct-v0.
llm_model = "mistralai/Mistral-7B-Instruct-v0.2"

# Basic Prompt
# "Context: {context}\n\nSentence: {sentence}\n\nIs the sentence supported by the context above? Answer Yes or No.\n\nAnswer: "

selfcheck_prompt = SelfCheckLLMPrompt(llm_model, device)

context = eval_df['gpt3_text'][5]
print("Context:\n")
print(context)
print("\n\n")

print(f"**Eval sentence:**\n {sentences[10]}")
print("\n\n")

# Context broken into sentences
nlp = spacy.load("en_core_web_md")
context_samples = [sent.text.strip() for sent in nlp(context).sents]

# Passage:
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [context],
    verbose = True
)
print(f"Passage (low score as expected): {prompt_scores}")

prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = context_samples,
    verbose = True
)
print(f"Avg score on sentences (better than NLI): {prompt_scores}")

specific_exp = """
In December 2018, he was named in Sri Lanka's team for the 2018 ACC Emerging Teams Asia Cup. He was the leading wicket-taker for Sri Lanka in the tournament, with nine dismiss
"""
prompt_scores = selfcheck_prompt.predict(
    sentences = [sentences[10]],                       
    sampled_passages = [specific_exp],
    verbose = True
)

print(f"Does it even work - comparing with the matched string (low score as expected)?: {prompt_scores}")

Thoughts?

potsawee / selfcheckgpt

Possible annotation errors? #26