Closed rfriel closed 1 year ago
Hi @rfriel ,
Thank you for your comments!
We have added a Jupyter notebook containing the probability-based baseline experiment. Please check it here: https://github.com/potsawee/selfcheckgpt/blob/main/demo/experiments/probability-based-baselines.ipynb. The notebook loads the OpenAI GPT-3 (text-davinci-003) outputs/responses, and performs the analysis (with our code to plot PR curves and obtain AUC). I hope you find this useful.
Sorry for the confusion about Non-Factual, the denotes "less-trivial hallucination detection" as in our setup/definition, the percentage of hallucination is quite high (in some examples, all sentences are labelled as major-hallucination), we would like to see the performance of these methods in detecting more challenging hallucination detection. You can see in the notebook above how we created "Non-Factual*" labels (human_label_detect_False_h
)
Potsawee
Thank you for the interesting paper, and for providing your models and data!
Do you have the code you used to produce the evaluation tables and plots in the paper, and would you be open to including it in this repo? That would help me a lot with an experiment I'm running.
Context:
I have been trying to reproduce the results from the paper, specifically the probability-based baselines for sentence-level factuality.
I'm using
text-davinci-003
for logprobs, splitting the tokens into the provided sentences, and computing average and minimum logprobs over sentences. For each of {average logprob, minimum logprob}, I compute it in two ways:I'm seeing very little relationship between any of these metrics and the annotations. My precision-recall curves look flat, with precision roughly equal to the base rate as recall ranges from 0 to 1 -- i.e., like random guessing.
It's possible (likely?) that I'm doing something wrong. I'll keep squinting at my code, but in the meantime, I thought I'd ask you guys for the code you used.
Another question: what is meant by **Non-Factual** (with the ) in the paper? It appears in all the evals, but I can't find any explanatory text about it.