Clarification on post-processing generated result

SuvodipDey commented 2 years ago

Hi. Kudos for this nice work. I am trying to reproduce the results on DailyDialog dataset. It will be very helpful if you can clarify the following details. In Issue #13, you mentioned using "nltk.word_tokenize() to tokenize the sentence and then concatenate the tokens" to make the format of the generated dialogue same as the reference response. I have two questions here,

Did you use any post-processing on the reference files?
Did you try only nltk.word_tokenize() or some other tokenizer as well?

It will be very useful if you can briefly mention your post-processing steps.

lizekang commented 2 years ago

Hi, sorry for the late response. We use multi-reference dailydialog dataset

For the reference files, we only lowercase the words.
We checked the reference files and found that nltk.word_tokenize() can match the format of reference files.

If you have any questions, please feel free to ask.

SuvodipDey commented 2 years ago

I got the following results on the DailyDialog dataset with the default settings. I fine-tuned the pre-trained Dialoflow models and used a beam size of 5 to generate the output followed by the NLTK tokenization step.

Base model bleu : [47.52, 25.180000000000003, 14.99, 9.5] nist : [3.0337, 3.5533, 3.6926, 3.7344] meteor : 15.795387062299081 entropy : [5.124008409948641, 7.878595854657088, 9.120790969580254, 9.761830789511295] div : [0.03529397236252269, 0.1907269959723588] avg_len : 12.014391691394659 Best model : epoch 23 Validation loss at epoch 23: 2.275749640226364, 0.06294980964809656, 5.164146141052246
Medium model bleu : [48.75, 26.6, 16.16, 10.440000000000001] nist : [3.1213, 3.6897, 3.8463, 3.8946] meteor : 16.304052665651195 entropy : [5.144744081651956, 7.941478091176311, 9.209349722332135, 9.843659464780313] div : [0.03810110214888246, 0.2018800778889411] avg_len : 12.048219584569733 Best model: epoch 9 Validation loss at epoch 9: 2.1608937056064605, 0.05948701365292072, 5.135612022399902
Large model bleu : [48.209999999999994, 26.36, 16.08, 10.47] nist : [3.1413, 3.7197, 3.8753, 3.9219] meteor : 16.239091665401443 entropy : [5.276497773225424, 8.151637518115594, 9.417401985653257, 10.004133672073646] div : [0.04034703399794552, 0.22087794866255284] avg_len : 11.987982195845698 Best Model : epoch 5 Validation loss at epoch 5: 2.1056192281246187, 0.07405774672329425, 5.126054071426392

There is a little gap between the reported results, especially for BLEU, NIST, and Meteor. Could you please help me figure out the source of this discrepancy.

ictnlp / DialoFlow

Clarification on post-processing generated result #16