neulab / BARTScore

BARTScore: Evaluating Generated Text as Text Generation
Apache License 2.0
314 stars 38 forks source link

Spearman Corrleations for Table-4 #22

Open Atharva-Phatak opened 2 years ago

Atharva-Phatak commented 2 years ago

In Table-4 in the paper, for summEval dataset you have measured COH, FAC, FLU, INFO. I wanted to know which variants of bart-score you used.

From my understanding of the paper, For factuality(FAC) you must have used BARTScore(s->h) i.e source -> hypothesis.

But i am not clear about FLU, COH and INFO.

If you could please elaborate that will be really helpful.

image

yyy-Apple commented 2 years ago

On the SummEval dataset, for FLU, COH and INFO, we also used BARTScore(s->h).

Atharva-Phatak commented 2 years ago

So what was the reason for using single score (s->h). Does BARTScore holistically measure quality of generated text ?

For example can you report s->h variant of BARTScore and say that overall from the basis of the score, the quality of Text Summary generated by Model A is better than Model B ?

Also how do you decide which BARTScore variant to use for a particular dataset to measure COH, FLU, INFO and FAC ?

Please let me know.

yyy-Apple commented 2 years ago

Here are some rules we have followed when deciding which BARTScore variant to use.

However, we agree that designing a metric with multiple interpretable dimensions will be a promising future work.