Spearman Corrleations for Table-4

neulab / BARTScore

BARTScore: Evaluating Generated Text as Text Generation

Apache License 2.0

314 stars 38 forks source link

Spearman Corrleations for Table-4 #22

Open Atharva-Phatak opened 2 years ago

Atharva-Phatak commented 2 years ago

In Table-4 in the paper, for summEval dataset you have measured COH, FAC, FLU, INFO. I wanted to know which variants of bart-score you used.

From my understanding of the paper, For factuality(FAC) you must have used BARTScore(s->h) i.e source -> hypothesis.

But i am not clear about FLU, COH and INFO.

If you could please elaborate that will be really helpful.

yyy-Apple commented 2 years ago

On the SummEval dataset, for FLU, COH and INFO, we also used BARTScore(s->h).

Atharva-Phatak commented 2 years ago

So what was the reason for using single score (s->h). Does BARTScore holistically measure quality of generated text ?

For example can you report s->h variant of BARTScore and say that overall from the basis of the score, the quality of Text Summary generated by Model A is better than Model B ?

Also how do you decide which BARTScore variant to use for a particular dataset to measure COH, FLU, INFO and FAC ?

Please let me know.

yyy-Apple commented 2 years ago

Here are some rules we have followed when deciding which BARTScore variant to use.

based on the definition of the evaluation perspective (for example, factuality must rely on the source document.)
modalities/languages supported by PLMs (for example, for Data-to-text, we can only use the h<->r due to the different modalities of source and hypothesis)

However, we agree that designing a metric with multiple interpretable dimensions will be a promising future work.