potsawee / selfcheckgpt

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
MIT License
467 stars 54 forks source link

Could you provide an example of using ngram to predict the factuality of a sentence? #16

Open tianyang-x opened 1 year ago

tianyang-x commented 1 year ago

Hi @potsawee , thanks for sharing your awesome work.

However, when trying to run your code, I found that even though there is an n-gram model, there are no examples of its usage provided. The n-gram model is very different from the others since its score ranges $[0, inf]$ while others range $[0,1]$. I tried using your method of normalizing it with $\frac{x-x{\max}}{x{\max}-x_{\min}}$, but this is probably not precise enough as the $x$ values during one output can all be similarly high or similarly low, thus not a reliable normalization method.

I wonder if my understanding has problems. Therefore, could you please share some better methods of normalizing, point out my problems, and share an example of evaluating the hallucination score using the n-gram method? Thanks.

potsawee commented 1 year ago

Hi @Amano-Aki, Thanks for trying out the work!

The n-gram variant is, at the moment, meant for ranking candidates, e.g., (1) sentence-level detection = ranking within the same document; or(2) passage-level ranking.

I only applied normalization ( $[0, inf) \rightarrow [0,1]$ ) when I tried combing scores of different variants. However, in obtaining AUC-PR and Correlation in the paper, I didn't normalize the n-gram score.

Apologise for the mistake in the current version of the paper on arxiv about the n-gram method. An updated paper will be made available soon. Also, you could try out SelfCheck with NLI and SelfCheck with LLM-prompting (e.g. Llama2) as well.