Open aman229 opened 4 years ago
Hi, thanks for your attention on this repo. Compared with the results in the original DailyDialog paper, the BLEU-1/2 score are lower but it can also be found that the BLEU-3/4 are much better. In my opinion, the BLEU-3/4 score are more suitable than BLEU-1/2, which indicates that the model can generate more fluently. So I think it is just okay. If you are still confused about it, feel free to contact me.
Hi, I tried running the Seq2Seq and HRED models on dialydialog dataset. Here are the results I got:
Model Seq2Seq Result BLEU-1: 0.215 BLEU-2: 0.0986 BLEU-3: 0.057 BLEU-4: 0.0366 ROUGE: 0.0492 Distinct-1: 0.0268; Distinct-2: 0.131 Ref distinct-1: 0.0599; Ref distinct-2: 0.3644 BERTScore: 0.1414
Model HRED Result BLEU-1: 0.2121 BLEU-2: 0.0961 BLEU-3: 0.0542 BLEU-4: 0.0331 ROUGE: 0.0502 Distinct-1: 0.0208; Distinct-2: 0.0992 Ref distinct-1: 0.0588; Ref distinct-2: 0.3619 BERTScore: 0.1436
These results seem to be much lower than the ones reported in the dailydialog paper: https://www.aclweb.org/anthology/I17-1099.pdf Do you have any clues on why is that the case? Thanks!