Code to reproduce the paper's evaluations

Thank you for the interesting paper, and for providing your models and data!

Do you have the code you used to produce the evaluation tables and plots in the paper, and would you be open to including it in this repo? That would help me a lot with an experiment I'm running.

Context:

I have been trying to reproduce the results from the paper, specifically the probability-based baselines for sentence-level factuality.

I'm using text-davinci-003 for logprobs, splitting the tokens into the provided sentences, and computing average and minimum logprobs over sentences. For each of {average logprob, minimum logprob}, I compute it in two ways:

Just using the logprob sent back from the API
Computing "probabilities" for the top 5 tokens based on the top 5 logprobs, then taking the log of these -- similar to how PPL5 is computed in the paper

I'm seeing very little relationship between any of these metrics and the annotations. My precision-recall curves look flat, with precision roughly equal to the base rate as recall ranges from 0 to 1 -- i.e., like random guessing.

It's possible (likely?) that I'm doing something wrong. I'll keep squinting at my code, but in the meantime, I thought I'd ask you guys for the code you used.

Another question: what is meant by **Non-Factual** (with the ) in the paper? It appears in all the evals, but I can't find any explanatory text about it.

Hi @rfriel ,

Thank you for your comments!

We have added a Jupyter notebook containing the probability-based baseline experiment. Please check it here: https://github.com/potsawee/selfcheckgpt/blob/main/demo/experiments/probability-based-baselines.ipynb. The notebook loads the OpenAI GPT-3 (text-davinci-003) outputs/responses, and performs the analysis (with our code to plot PR curves and obtain AUC). I hope you find this useful.
Sorry for the confusion about Non-Factual, the denotes "less-trivial hallucination detection" as in our setup/definition, the percentage of hallucination is quite high (in some examples, all sentences are labelled as major-hallucination), we would like to see the performance of these methods in detecting more challenging hallucination detection. You can see in the notebook above how we created "Non-Factual*" labels (human_label_detect_False_h)

Potsawee

potsawee / selfcheckgpt

Code to reproduce the paper's evaluations #1