gmftbyGMFTBY / MultiTurnDialogZoo

Multi-turn dialogue baselines written in PyTorch
MIT License
162 stars 23 forks source link

Performance on DailyDialog dataset #8

Open aman229 opened 4 years ago

aman229 commented 4 years ago

Hi, I tried running the Seq2Seq and HRED models on dialydialog dataset. Here are the results I got:

Model Seq2Seq Result BLEU-1: 0.215 BLEU-2: 0.0986 BLEU-3: 0.057 BLEU-4: 0.0366 ROUGE: 0.0492 Distinct-1: 0.0268; Distinct-2: 0.131 Ref distinct-1: 0.0599; Ref distinct-2: 0.3644 BERTScore: 0.1414

Model HRED Result BLEU-1: 0.2121 BLEU-2: 0.0961 BLEU-3: 0.0542 BLEU-4: 0.0331 ROUGE: 0.0502 Distinct-1: 0.0208; Distinct-2: 0.0992 Ref distinct-1: 0.0588; Ref distinct-2: 0.3619 BERTScore: 0.1436

These results seem to be much lower than the ones reported in the dailydialog paper: https://www.aclweb.org/anthology/I17-1099.pdf Do you have any clues on why is that the case? Thanks!

gmftbyGMFTBY commented 4 years ago

Hi, thanks for your attention on this repo. Compared with the results in the original DailyDialog paper, the BLEU-1/2 score are lower but it can also be found that the BLEU-3/4 are much better. In my opinion, the BLEU-3/4 score are more suitable than BLEU-1/2, which indicates that the model can generate more fluently. So I think it is just okay. If you are still confused about it, feel free to contact me.