songhaoyu / BoB

The released codes for ACL 2021 paper 'BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data'
https://aclanthology.org/2021.acl-long.14/
Apache License 2.0
136 stars 24 forks source link

How to calculate and evaluate the ppl of D1 and D2. #2

Closed iyo-0713 closed 3 years ago

iyo-0713 commented 3 years ago

First of all, I would like to know how to calculate the ppl of d1 and d2.

And I also have a question about how to evaluate the values of d1 and d2 ppl. In "How to Run", it was written as follows.

Empirically, in the PersonaChat experiment with default hyperparameter settings, the best-performing checkpoint should be found between epoch 5 and epoch 9. If the training procedure goes fine, there should be some results like: Perplexity on test set is 21.037 and 7.813. where 21.037 is the ppl from the first decoder and 7.813 is the final ppl from the second decoder. And the generated results is redirected to test_result.tsv, here is a generated example from the above checkpoint:

However, as the number of epochs increases, the d2 ppl decreases, and in epoch 49 it drops to 1.957. (My result for epoch 7 was Perplexity on test set is 27.675 and 22.045., so it may be different from other people's value for epoch 49.) Indeed, in epoch 49, the d1 ppl worsened to 249.0. However, as long as the final output of the model, the d2 score, improves, we don't need to worry about the d1 score. Please tell us why you decided that epoch 7 (Perplexity on test set is 21.037 and 7.813.) is optimal.

haoyusoong commented 3 years ago

have you ever taken a look at the generated results?

haoyusoong commented 3 years ago

The ppl calculation: https://huggingface.co/transformers/perplexity.html

haoyusoong commented 3 years ago

However, as long as the final output of the model, the d2 score, improves, we don't need to worry about the d1 score. Please tell us why you decided that epoch 7 (Perplexity on test set is 21.037 and 7.813.) is optimal.

PPL is just one of the indicators, and there are many other metrics. Our goal is to generate good dialogue responses rather than getting the extremely low ppl. The Epochs > 15 are usually overfitted on the ppl metric and suffer a significant quality drop of the responses. In our test run, epoch 7 delivers good responses and has a competitive performance on all metrics, including the relatively good ppl (cf. baselines).

iyo-0713 commented 3 years ago

I see. I understand now. Thank you very much for answering my question.

Wenze7 commented 2 years ago

Hi, bro, I would like to ask that have you reproduced the results mentioned in paper?

iyo-0713 commented 2 years ago

Hi, bro. I could not reproduce the results mentioned in paper.

Wenze7 commented 2 years ago

That's too bad, i tried to contact with the author, but I never receive reply. Do you know of any papers that use NLI and can reproduce the results?

iyo-0713 commented 2 years ago

Uhm... My friend also used this model, but couldn't reproduce the result. I think it is difficult to reproduce the result in this paper. I changed the model using in my research. I don't know other models using NLI.

Wenze7 commented 2 years ago

Fine, thanks!