Closed iyo-0713 closed 3 years ago
have you ever taken a look at the generated results?
The ppl calculation: https://huggingface.co/transformers/perplexity.html
However, as long as the final output of the model, the d2 score, improves, we don't need to worry about the d1 score. Please tell us why you decided that epoch 7 (Perplexity on test set is 21.037 and 7.813.) is optimal.
PPL is just one of the indicators, and there are many other metrics. Our goal is to generate good dialogue responses rather than getting the extremely low ppl. The Epochs > 15 are usually overfitted on the ppl metric and suffer a significant quality drop of the responses. In our test run, epoch 7 delivers good responses and has a competitive performance on all metrics, including the relatively good ppl (cf. baselines).
I see. I understand now. Thank you very much for answering my question.
Hi, bro, I would like to ask that have you reproduced the results mentioned in paper?
Hi, bro. I could not reproduce the results mentioned in paper.
That's too bad, i tried to contact with the author, but I never receive reply. Do you know of any papers that use NLI and can reproduce the results?
Uhm... My friend also used this model, but couldn't reproduce the result. I think it is difficult to reproduce the result in this paper. I changed the model using in my research. I don't know other models using NLI.
Fine, thanks!
First of all, I would like to know how to calculate the ppl of d1 and d2.
And I also have a question about how to evaluate the values of d1 and d2 ppl. In "How to Run", it was written as follows.
Empirically, in the PersonaChat experiment with default hyperparameter settings, the best-performing checkpoint should be found between epoch 5 and epoch 9. If the training procedure goes fine, there should be some results like:
Perplexity on test set is 21.037 and 7.813.
where 21.037 is the ppl from the first decoder and 7.813 is the final ppl from the second decoder. And the generated results is redirected to test_result.tsv, here is a generated example from the above checkpoint:However, as the number of epochs increases, the d2 ppl decreases, and in epoch 49 it drops to 1.957. (My result for epoch 7 was
Perplexity on test set is 27.675 and 22.045.
, so it may be different from other people's value for epoch 49.) Indeed, in epoch 49, the d1 ppl worsened to 249.0. However, as long as the final output of the model, the d2 score, improves, we don't need to worry about the d1 score. Please tell us why you decided that epoch 7 (Perplexity on test set is 21.037 and 7.813.
) is optimal.