nlpyang / geval

Code for paper "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"
MIT License
261 stars 27 forks source link

Scoring function clarification #6

Closed yashashree-k closed 8 months ago

yashashree-k commented 8 months ago

Hello, I was reading your paper and had a question regarding scoring. I looked through the code in this repo but did not find the implementation for the scoring function.

Acc to the paper, given a set of scores (like from 1 to 5) predefined in the prompt S = {S1, S2, ..., Sn}, the probability of each score p(S[i]) is calculated by the LLM and the final score = sum(prob(S[i]) * S[i]) where i = 1 to n.

In theory I understand this approach, but in practice how would one go about implementing it? Since the prompts follow a form filling paradigm, the response only contains the integer rating the LLM assigned to the summary text, therefore we only have access to the logprob of that specific score (ie only 1 value between 1 to n). How would we access the logprobs of tokens that are not present in the LLM response (ie the remaining 1 to n scores in the summation)?

Please let me know if I'm missing something. Thanks in advance!

Peilun-Li commented 8 months ago

I think with the latest OpenAI API https://platform.openai.com/docs/api-reference/chat/create it's possible by passing in both logprobs and top_logprobs.

But yeah I do wonder the same. It seems there are (at least) two ways to interpret the paper's idea "using the probabilities of output tokens from LLMs to normalize the scores and take their weighted summation as the final results":

  1. The logprob way
  2. Ask the model to generate multiple choices, i.e., the n parameter in OpenAI API. Then take an average.

It looks like the code is implementing the second way through n=20, but I do also see a commented usage of logprobs there - probably that's when openai disabled the usage of logprobs (before later then re-enabled it)

yashashree-k commented 8 months ago

Thanks for your reply @Peilun-Li! That answers most of my questions. I'd be curious to hear why the authors chose sampling, I think it makes sense if we want to verify reliability of the discrete scores for the same inputs. For the purpose of obtaining a continuous score and having enough granularity to differentiate between results, I think using top_logprobs is sufficient to implement the function as described in the paper.