Open nightdessert opened 3 years ago
Noticed the metric based on uncased-bert, I did use lower-cased inputs .
I've got same result... When I used 'generated data' and 'annotated data', It worked well. But gold data(cnndm) results strange.
I used summary sentences as claim. (one summary has several sentences, I splitted it and used each sentence as each claim.)
As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result: Eval results bacc = 0.41546565056595314 f1 = 0.41546565056595314 loss = 3.5899247798612546
On the author annotated dataset,give the result as follows: Eval results bacc = 0.7611692646110668 f1 = 0.8614393125671321 loss = 0.8623681812816171
Some of my observations below:
In summary, I think FactCC can identify local errors like swapping entities or numbers. However, don't count on it to solve the hard NLI problem. Overall, it's still one of the better metrics. You can also check out the following paper.
Goyal, Tanya, and Greg Durrett. "Evaluating factuality in generation with dependency-level entailment." arXiv preprint arXiv:2010.05478 (2020).
Greatly Appreciate discussion above. Is there anyone retrain and finetuing the model to get result on CNNDM or other dataset? Will that help to get a more precision Fact-Evalutaion? If not, FactCC as evaluator is reliable to mentioned in Paper as a Metric?
I notice that somepaper use FactCC as a metric If FactCC remains problem, then the result maybe not reliable to be a metric mentioned in Paper
@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries.
Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020).
Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).
@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries.
Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020).
Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).
thanks a lot <3
As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result: Eval results bacc = 0.41546565056595314 f1 = 0.41546565056595314 loss = 3.5899247798612546
On the author annotated dataset,give the result as follows: Eval results bacc = 0.7611692646110668 f1 = 0.8614393125671321 loss = 0.8623681812816171
My annotated dataset result is the same as yours. This result, however, is not consistent with the Table-3 F-1 score for FactCC. Anyone have an intuition for why?
I really appreciate the excellent paper. I tested factCC on CNN/DM dataset using gold reference sentences as claims(splitted into single sentence). I strictly followed md, and used the official pre-trained factCC checkpoint. I labeled all the claims as 'CORRECT' (because they are gold references). The accuracy output by factCC is around 42% which means the model thinks only 42% of the reference sentences is factuality correct. Is this reasonable or did I wrongly use the metric ?