Performance on DailyDialog dataset

gmftbyGMFTBY / MultiTurnDialogZoo

Multi-turn dialogue baselines written in PyTorch

MIT License

162 stars 23 forks source link

Hi, I tried running the Seq2Seq and HRED models on dialydialog dataset. Here are the results I got:

Model Seq2Seq Result BLEU-1: 0.215 BLEU-2: 0.0986 BLEU-3: 0.057 BLEU-4: 0.0366 ROUGE: 0.0492 Distinct-1: 0.0268; Distinct-2: 0.131 Ref distinct-1: 0.0599; Ref distinct-2: 0.3644 BERTScore: 0.1414

Model HRED Result BLEU-1: 0.2121 BLEU-2: 0.0961 BLEU-3: 0.0542 BLEU-4: 0.0331 ROUGE: 0.0502 Distinct-1: 0.0208; Distinct-2: 0.0992 Ref distinct-1: 0.0588; Ref distinct-2: 0.3619 BERTScore: 0.1436

These results seem to be much lower than the ones reported in the dailydialog paper: https://www.aclweb.org/anthology/I17-1099.pdf Do you have any clues on why is that the case? Thanks!

gmftbyGMFTBY / MultiTurnDialogZoo

Performance on DailyDialog dataset #8