potsawee / selfcheckgpt

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
MIT License
442 stars 54 forks source link

range of selfcheck_bertscore #10

Closed EngSalem closed 1 year ago

EngSalem commented 1 year ago

Hello, I am trying to use your work to estimate the factuality of samples. I am just getting relatively low scores for the selfcheck_bertscore even when the samples are totally contradicting. I was wondering how did you choose if a passage is factual or nonfactual.

Thank you

potsawee commented 1 year ago

Hi @EngSalem,

Sorry for my later reply -- I only saw this just now.

So, you could run selfcheck-BERTScore on your development dataset, and choose an optimal threshold.

Note that someone pointed out to me that the current BERTScore is not properly scaled; therefore, the values are extremely high (or low in our case) see this issue here: https://github.com/Tiiiger/bert_score/blob/master/journal/rescale_baseline.md

I will edit the code soon to have rescale_with_baseline as an option for BERTScore soon, but you are welcome to add it to yourself.

Best, Potsawee

potsawee commented 1 year ago

rescale_with_baseline option has been added to SelfCheck-BERTScore