neulab / BARTScore

BARTScore: Evaluating Generated Text as Text Generation
Apache License 2.0
318 stars 37 forks source link

Meaning of the value range? #26

Closed GabrielLin closed 2 years ago

GabrielLin commented 2 years ago

Thank you for your work. Could you please tell us what is corresponding description of the value range? Such as [-1, -3]: similar; [-3, -6] normal; [-6, -100] not similar or so. Thanks.

yyy-Apple commented 2 years ago

Hi, we don't set clear boundaries for similarity and dissimilarity.

GabrielLin commented 2 years ago

Hi @yyy-Apple Thank you for your reply. That is my concern. I use BARTScore and get a result, but I do not know what the score stands for. Could you please give us some examples if possible?

neubig commented 2 years ago

Hi @GabrielLin , I unfortunately don't think this is possible, as the good and bad values of BARTScore will depend on the experimental setting. However, this is a common problem with evaluation metrics, not just for BARTScore. For example, BLEU score doesn't really have a consistent "good" or "bad" value either, a translation system with a BLEU score of 20 could be either very bad or quite good depending on the evaluation dataset. So basically you should probably look at the outputs that you're evaluating and form an idea of what values are good or bad for your particular dataset.

GabrielLin commented 2 years ago

Hi @neubig You are right. The reason why I ask this question is that I am not familiar with BARTScore. I agree with you. As a metric, BARTScore should be considered as other metrics like BLEU. I will make some examples myself. Thank you.