Closed hbr690188270 closed 4 months ago
Hi @hbr690188270, thanks for trying the selfcheckgpt package! regarding your questions
On this issue, I'd look at scatter plots of passage-level scores (i.e., average over sentences for each document) to see if there is a similar trend for different NLI models. I'll try this when I have more time.
I'll reply more this weekend when I have time to work on this!
@potsawee I also had the same question, previously in my usecase i used facebook/bart-large-mnli for NLI, did you do any study to arrive that potsawee/deberta-v3-large-mnli is best for this task. Sry new to the field, do we have any kind of comparisons between these models.
Sorry again for my late response.
SelfCheckGPT with LLM-Promting using OpenAI's API was implemented and merged into this selfcheckgpt
package last month.
Regarding NLI models, I haven't had the chance to compare different NLI models yet. But I don't think there is anything special about potsawee/deberta-v3-large-mnli
. I chose to use my own model as deberta-v3 was a SOTA encoder-only model at the time of conducting this experiment and I couldn't find its variant fine-tuned to Multi-NLI, so I trained one by myself (using standard cross-entropy loss). I would expect all deberta-v3-mnli models to have similar performance on this task and on the original MNLI tasks. If you have the results on other NLI model on MNLI testset, I'm happy to evaluate potsawee/deberta-v3-large-mnli
again.
Best wishes Potsawee
Hi @potsawee, thank you for the impressive method and easy to use dataset. Two quick questions about the SelfCheckGPT-NLI and SelfCheckGPT-Prompt methods, which achieve quite impressive performance.
Regarding on the SelfCheckGPT-NLI method, I noticed that you use the "potsawee/deberta-v3-large-mnli" NLI model. Since in my previous practice I was using other NLI models, so I replace it with other NLI models including "microsoft/deberta-large-mnli" and "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli". I also modify the contradiction probabilities following your suggestion by removing the probability of the class "neutral". However, the performance of the two NLI models seems a little weak. I test the AUC-PR on the "NonFact" setting with the first 20 passages in the dataset (around 400 sentences), and here is the performance:
Regarding on the SelfCheckGPT-Prompt method, I see you have updated the code several hours ago. It seems the current code only support the open-sourced models instead of the gpt-3.5-turbo/Text-Davinci-003 used in the main paper (Table 2). Is there an estimated timeline to release the code for SelfCheckGPT-Prompt with gpt-3.5-turbo/Text-Davinci-003?
Thank you so much for your impressive work and effort!