krystalan / chatgpt_as_nlg_evaluator

Technical Report: Is ChatGPT a Good NLG Evaluator? A Preliminary Study
https://arxiv.org/abs/2303.04048
41 stars 1 forks source link

score extraction heuristics #3

Closed UntotaufUrlaub closed 1 year ago

UntotaufUrlaub commented 1 year ago

Hi,

could you please share the heuristics you used to convert the free-text responses of chatgpt into a numerical score. This would increase accessibility / reproducibility and would make comparing different meta evaluation results more consistent.

kind regards.

krystalan commented 1 year ago
def extract_stars_from_sentence(s):
    try:
        res = s.split(' ')
        assert res[1].startswith('star'), print(s)
        score = res[0].lower()

        if score in ['1', '2', '3', '4', '5']:
            return int(score)
        else:
            mapping = {
                'one': 1,
                'two': 2,
                'three': 3,
                'four': 4,
                'five': 5
            }
            assert score in mapping
            return mapping[score]
    except:
        return 1

def extract_scores_from_sentence(s):
    res = re.findall('\d+', s)
    try:
        return int(res[0])
    except:
        return 0
UntotaufUrlaub commented 1 year ago

Thank you very much for the fast answer!

I am closing the issue as the original question is answered, but i would be thankfull if you could share your thoughts on the following question.

Are you concerned about whether the default values for unparseable answers are biasing the metric or distorting the evaluation?