Open Atharva-Phatak opened 2 years ago
On the SummEval dataset, for FLU, COH and INFO, we also used BARTScore(s->h).
So what was the reason for using single score (s->h). Does BARTScore holistically measure quality of generated text ?
For example can you report s->h variant of BARTScore and say that overall from the basis of the score, the quality of Text Summary generated by Model A is better than Model B ?
Also how do you decide which BARTScore variant to use for a particular dataset to measure COH, FLU, INFO and FAC ?
Please let me know.
Here are some rules we have followed when deciding which BARTScore variant to use.
However, we agree that designing a metric with multiple interpretable dimensions will be a promising future work.
In Table-4 in the paper, for summEval dataset you have measured COH, FAC, FLU, INFO. I wanted to know which variants of bart-score you used.
From my understanding of the paper, For factuality(FAC) you must have used BARTScore(s->h) i.e source -> hypothesis.
But i am not clear about FLU, COH and INFO.
If you could please elaborate that will be really helpful.