Closed ktangri closed 1 year ago
Hi @ktangri thanks for checking out the repository!
In the reported experiments, we use AUC-PR for the sentence-level detection task and Correlation for the passage-level detection task. These evaluations don't require a threshold to be set.
And if you want to obtain an optimal threshold (e.g., say for a deployment etc), you could, for example, optimise on some metric to find the optimal threshold value on some development set. If we optimise F1-score on the annotated dataset in my work using , the thresholds are:
Target | Threshold |
---|---|
Non-Factual | P(contradict) = 0.5397 |
Factual | P(entail) = 1-P(contradict) = 0.2948 |
Note that in the implementation, only the logits of "entail" and "contradict" classes are used, so $P(\text{entail}) + P(\text{contradict}) = 1.0$.
Hope this answers your question.
Ahh I see, thank you for the clarification! I arrived at a similar threshold for a different benchmark which is a nice confirmation.
Hi @potsawee, great paper and thanks for the easy to use library! One quick question, for the NLI results you show in this repo, what is the probability threshold you are using to determine factual vs. non-factual?
Thanks!