Questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt

hbr690188270 commented 7 months ago

Hi @potsawee, thank you for the impressive method and easy to use dataset. Two quick questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt methods, which achieve quite impressive performance.

Regarding on the SelfCheckGPT-NLI method, I noticed that you use the "potsawee/deberta-v3-large-mnli" NLI model. Since in my previous practice I was using other NLI models, so I replace it with other NLI models including "microsoft/deberta-large-mnli" and "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli". I also modify the contradiction probabilities following your suggestion by removing the probability of the class "neutral". However, the performance of the two NLI models seems a little weak. I test the AUC-PR on the "NonFact" setting with the first 20 passages in the dataset (around 400 sentences), and here is the performance:
- potsawee/deberta-v3-large-mnli: leads to AUC-PR 92.5
- microsoft/deberta-large-mnli: AUC-PR 89.7
- MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli: 88.6 Does this phenomenon implies the SelfCheckGPT-NLI is relatively sensitive to the choices of the NLI model?
Regarding on the SelfCheckGPT-Prompt method, I see you have updated the code several hours ago. It seems the current code only support the open-sourced models instead of the gpt-3.5-turbo/Text-Davinci-003 used in the main paper (Table 2). Is there an estimated timeline to release the code for SelfCheckGPT-Prompt with gpt-3.5-turbo/Text-Davinci-003?

Thank you so much for your impressive work and effort!

potsawee commented 7 months ago

Hi @hbr690188270, thanks for trying the selfcheckgpt package! regarding your questions

I agree that we'll observe some performance differences when using different NLI models. I haven't tried a different model yet. But from your results, I wouldn't be surprised to see that difference though as you tried it on 20 examples, which can result in some noise.

On this issue, I'd look at scatter plots of passage-level scores (i.e., average over sentences for each document) to see if there is a similar trend for different NLI models. I'll try this when I have more time.

Yes, I actually plan on implementing and testing LLM-Prompt with OpenAI this weekend.

I'll reply more this weekend when I have time to work on this!

Kirushikesh commented 7 months ago

@potsawee I also had the same question, previously in my usecase i used facebook/bart-large-mnli for NLI, did you do any study to arrive that potsawee/deberta-v3-large-mnli is best for this task. Sry new to the field, do we have any kind of comparisons between these models.

potsawee commented 5 months ago

Sorry again for my late response.

SelfCheckGPT with LLM-Promting using OpenAI's API was implemented and merged into this selfcheckgpt package last month.
Regarding NLI models, I haven't had the chance to compare different NLI models yet. But I don't think there is anything special about potsawee/deberta-v3-large-mnli. I chose to use my own model as deberta-v3 was a SOTA encoder-only model at the time of conducting this experiment and I couldn't find its variant fine-tuned to Multi-NLI, so I trained one by myself (using standard cross-entropy loss). I would expect all deberta-v3-mnli models to have similar performance on this task and on the original MNLI tasks. If you have the results on other NLI model on MNLI testset, I'm happy to evaluate potsawee/deberta-v3-large-mnli again.

Best wishes Potsawee

potsawee / selfcheckgpt

Questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt #24