potsawee / selfcheckgpt

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
MIT License
442 stars 54 forks source link

Questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt #24

Closed hbr690188270 closed 4 months ago

hbr690188270 commented 7 months ago

Hi @potsawee, thank you for the impressive method and easy to use dataset. Two quick questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt methods, which achieve quite impressive performance.

Thank you so much for your impressive work and effort!

potsawee commented 7 months ago

Hi @hbr690188270, thanks for trying the selfcheckgpt package! regarding your questions

  1. I agree that we'll observe some performance differences when using different NLI models. I haven't tried a different model yet. But from your results, I wouldn't be surprised to see that difference though as you tried it on 20 examples, which can result in some noise.

On this issue, I'd look at scatter plots of passage-level scores (i.e., average over sentences for each document) to see if there is a similar trend for different NLI models. I'll try this when I have more time.

  1. Yes, I actually plan on implementing and testing LLM-Prompt with OpenAI this weekend.

I'll reply more this weekend when I have time to work on this!

Kirushikesh commented 7 months ago

@potsawee I also had the same question, previously in my usecase i used facebook/bart-large-mnli for NLI, did you do any study to arrive that potsawee/deberta-v3-large-mnli is best for this task. Sry new to the field, do we have any kind of comparisons between these models.

potsawee commented 5 months ago

Sorry again for my late response.

  1. SelfCheckGPT with LLM-Promting using OpenAI's API was implemented and merged into this selfcheckgpt package last month.

  2. Regarding NLI models, I haven't had the chance to compare different NLI models yet. But I don't think there is anything special about potsawee/deberta-v3-large-mnli. I chose to use my own model as deberta-v3 was a SOTA encoder-only model at the time of conducting this experiment and I couldn't find its variant fine-tuned to Multi-NLI, so I trained one by myself (using standard cross-entropy loss). I would expect all deberta-v3-mnli models to have similar performance on this task and on the original MNLI tasks. If you have the results on other NLI model on MNLI testset, I'm happy to evaluate potsawee/deberta-v3-large-mnli again.

Best wishes Potsawee