About the Evaluation of Dialogue Generation

gmftbyGMFTBY commented 1 year ago

GPTScore contains very elaborate experimental results for the generation-based evaluation method for lots of downstream NLG tasks, and thank you so much for your work.

Recently, I also notice that large-scale language models may become a universal and powerful evaluation method, and I also conduct some experimental results on the meta-evaluation benchmarks of dialog generation task, for example, the Empathetic-Eval mentioned in MDD-Eval.

However, I notice that the GPT-3 and other publicly available large-scale language models have a very limited correlation with human judgments (person and spearman scores). I notice that you only conduct the experimental results on the FED-turn and FED-dialog meta-evaluations. I wonder that have you ever noticed the similar experimental results that I found on other meta-evaluation benchmarks (not FED-turn and FED-Dialog).

Looking forward to get response from you.

ddehun commented 1 year ago

Thank you for sharing your awesome work, @jinlanfu! In addition to the original question by @gmftbyGMFTBY, I also have a question about the evaluation protocol of the dialogue response evaluation task. As described in Appendix D, the dialogue evaluation task is formulated as a question-answering format unlike other evaluation tasks (e.g., data-to-text). In this case, do only probabilities of "Yes" and "No" tokens are considered to make a metric score? Thank you in advance.

jinlanfu commented 1 year ago

Thanks for your interest in our work. We use the probability of generating 'Yes' as the score (without considering the generation of 'No').

Hyfred commented 6 months ago

Thanks for your interest in our work. We use the probability of generating 'Yes' as the score (without considering the generation of 'No').

Thanks for your impressive experiments, I was wondering why you turned the dialogue evaluation task into a question-answering task, instead of just using the same form as the other three tasks.

SkAndMl commented 1 month ago

Thanks for your interest in our work. We use the probability of generating 'Yes' as the score (without considering the generation of 'No').

So you do not consider the average of the log probs of the whole string?

jinlanfu / GPTScore

About the Evaluation of Dialogue Generation #3