baseline models - Githubissues

marvinzh commented 4 years ago

Hi, I found this repo is quite helpful for the dialog system research community. I was wondering are these baseline models provided in this repo ready to use or still in the development? Thank you!

gmftbyGMFTBY commented 4 years ago

Hi, thank you for your attention. 1) The baselines which are ready to use: HRED, VHRED, ReCoSa (MReCoSa), HRAN, WSeq, DSHRED, Seq2Seq-attn. The transformer models are still in development. If you can provide the Transformers (PyTorch version, and at least better than Seq2Seq-attn) I will very appreciate it.

2) About the performance: Hi, I'm doing extensive experiments about these models and report the performance soon. So far, here are my partial observations on the DailyDialog dataset and I will push these experimental results into this repo in a month. (I'm so sorry that GitHub seems can not upload the .png file. Maybe I will show them in this issue soon.)

3) Thank you for your help: I'm still a newbie NLPer for the dialogue systems. I'm still struggling to finish the transformer-based baselines such as GPT2 or Transformer(Seq2Seq). If you're familiar with transformer (PyTorch version) and can provide the Transformer codes, I will be very thankful.

4) The code structure are unsatisfied. I will retouch them in about a month.

Hope you have a good day.

marvinzh commented 4 years ago

Hi, thank you for your prompt and informative reply.

The baselines which are ready to use: HRED, VHRED, ReCoSa (MReCoSa), HRAN, WSeq, DSHRED, Seq2Seq-attn.

It is great! I'll try it.

I'm still struggling to finish the transformer-based baselines such as GPT2 or Transformer(Seq2Seq).

I have been working on applying transformer-based model on multi-turn dialog modeling these months and also implementing a toolkit for quick prototyping of multi-turn dialog modeling but mainly focusing on transformer-based models.

my current transformer implementation (in pytorch) got decent score (34.4 bleu score, small setting) on IWSLT14-de-en when comparing with other available implementation on the github. I'll open source it later and it would be nice if it could be helpful to you. :)

Thank you.

gmftbyGMFTBY commented 4 years ago

Amazing, cannot wait to learn from your codes, Thank you so much!

gmftbyGMFTBY commented 4 years ago

Hi, Do you apply your transformer model on the dialogue corpus to measure the performance? And when will you release the codes? I'm very excited about it :)

Hope to get your response.

marvinzh commented 4 years ago

Hi, sorry for the late response.

yesterday I made a mistake, the bleu score obtained from my implementation is not ~~34.4~~, it's 33.4. I mixed it with the score reported in other papers. anyway, I open source my transformer implementation and data at

https://github.com/marvinzh/trs.git

AFAIK, it's quite tricky to train the transformer model. there are many factors (e.g. clip_norm, optimizer, lr ,etc.) that could affect the performance even the implementation is right. In my experiments, I test my model under a single RTX2070+cu101

Also, I'd like to ask do you have evaluation score on dailydialog dataset from these baseline models supported in your software. it would be great if you could share it with me! :)

Hi, Do you apply your transformer model on the dialogue corpus to measure the performance? And when will you release the codes? I'm very excited about it :)

Hope to get your response.

gmftbyGMFTBY commented 4 years ago

Hi, first of all, thank you so much for your open source codes. I think I will learn a lot from it.

Yes, I can share the results with you (The results are in TensorBoard, but the .png file cannot be uploaded in GitHub, so I share the partial results with you). For all the datasets (Dailydialog, EmpChat, PersonaChat, DSTC7-AVSD, Ubuntu), I use these metrics for evaluating (Human evaluation is hard to obtain, so I don't use it.)

PPL
BLEU(1-4)
ROUGE (ROUGE-2)
Distinct-1/2
Embedding-Average, Vector-Extrema, Greedy-Matching

The results of the baselines on Dailydialog dataset are shown as follows:	Models	PPL	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-2	Dist-1	Dist-2	EA
Seq2Seq	28.69	0.2178	0.1015	0.0591	0.0382	0.0557	0.0244	0.1204	0.8738	0.8407
HRED	31.49	0.1971	0.0863	0.0458	0.0261	0.0443	0.0147	0.0711	0.868	0.833
WSeq	36.05	0.2006	0.0843	0.0431	0.0238	0.0385	0.0114	0.0710	0.8675	0.8333
DSHRED	37.59	0.2054	0.0922	0.0504	0.0301	0.0489	0.0185	0.0885	0.08705	0.836
VHRED	32.63	0.1828	0.0776	0.0401	0.0226	0.0387	0.0166	0.0698	0.8631	0.8257
HRAN	28.69	0.2263	0.1109	0.0678	0.0461	0.0631	0.0267	0.1371	0.8741	0.8407
ReCoSa	34.46	0.1911	0.0832	0.044	0.0251	0.0424	0.0124	0.0580	0.8649	0.8293

The responses generated by the models looks good for me.

Yes, the transformer model is very hard to train which is the fatal weakness of it. And I try the implementation in OpenNMT-py to train the multi-turn dialogue systems. I'm glad to see that the OpenNMT-py's transformer model is only better than my implemented models on Dailydialog dataset and much worse than mine on other four datasets. So I think the implementation in this repo is just fine.

@marvinzh, Hi, can you try the Dailydialog dataset on your trs repo and share the results with me? I really hate the transformers right now, and already write the new baseline which have the transformer encoder and GRU decoder (transformer decode cannot work in my implementation). Maybe we can share the experimental results with each other?

marvinzh commented 4 years ago

Thank you for sharing the result.

Could you also share the ground truth - generated response pair of the model shown in the above table so that I can test it under my environment. (any format is ok, I can process it later) because I found some of the score obtained by me is quite different from yours. and for the embedding-based metric, I noticed you use pre-trained glove embedding while I'm using google's word2vec, so I guess it might affect the result a little bit?

Maybe we can share the experimental results with each other?

Sure! it would be great!

gmftbyGMFTBY commented 4 years ago

Hi, I try to upload the generated file but GitHub always raise the error (Something went really wrong, and we can't process that file), which is the same with the .png file. (I don't get it.)

Maybe you can give me your email address and I can send it to you.

marvinzh commented 4 years ago

Sure, my mail is baiyuu.cs[AT]gmail.com, feel free to contact me!

gmftbyGMFTBY commented 4 years ago

The mail has been sent. If you find any questions, feel free to contact me. By the way, do you think that the transformer encoder and GRU decoder will work together?

marvinzh commented 4 years ago

Hi, Thank you for sharing the file!

By the way, do you think that the transformer encoder and GRU decoder will work together?

well, it's hard to say. in your case, the decoder GRU need to generated each word conditioned on the output of encoder transformer, while the output of encoder is actually contextual word embeddings, I'm sure you can eventually make it work but I doubt the performance.

as for the file you sent me before, is that result for HRED model? I test it using BLEU-4, dist-n and embedding based similarity. here is the score in my environments. I found it generally aligns with yours except the embedding-based one (that's totally fine)

bleu::  0.026060156595983073
distinct-1:  0.014626877377357751
distinct-2:  0.07118675084637203
average:  0.40945895230457285
greedy:  0.3217714133217234
extrema:  0.29181506281867986

Also, I would like to confirm what data contained by the file you sent me before, is it test set of dailydialog ?

FYI, today I also tried another open source implementation of HRED where it gives 0.074099 and 0.333950 on dist-1 and dist-2, respectively. It's little bit weird that all of your models have low dist-n score.

marvinzh commented 4 years ago

hi, do you mind if we sync our code for evaluating the generated responses? it will makes our result directly comparable. If so, I can create a repo so that we can share and sync our implementation for different metric

gmftbyGMFTBY commented 4 years ago

1) Also, I would like to confirm what data contained by the file you sent me before, is it test set of dailydialog ? FYI, today I also tried another open source implementation of HRED where it gives 0.074099 and 0.333950 on dist-1 and dist-2, respectively. It's little bit weird that all of your models have low dist-n score. Yes, it's the test dataset of the dailydialog, and I also calculate the ground-truth dist-1 and dist-2, the results are 0.0577 and 0.3594. So the high distinct score also makes me confused.

2) hi, do you mind if we sync our code for evaluating the generated responses? it will makes our result directly comparable. Hi, my evaluation metric are saved in folder metric and eval.py show the introduction of using them. By the way, I will also try the Google w2v, by the way.

Forget somthing, I tried the OpenNMT-py's transformer on Dailydialog, the results are shown as follow:

BLEU(1/2/3/4): 0.2258/0.1505/0.1281/0.1183
ROUGE: 0.1246
Dist-1: 0.0583
Dist-2: 0.2479

But OpenNMT-py performs much worse than my implementation on other datasets, whatever.

Sorry, I make a mistake. The results above are the Seq2Seq model of OpenNMT-py. Transformer is much worse.

marvinzh commented 4 years ago

Also, I would like to confirm what data contained by the file you sent me before, is it test set of dailydialog ? FYI, today I also tried another open source implementation of HRED where it gives 0.074099 and 0.333950 on dist-1 and dist-2, respectively. It's little bit weird that all of your models have low dist-n score. Yes, it's the test dataset of the dailydialog, and I also calculate the ground-truth dist-1 and dist-2, the results are 0.0577 and 0.3594. So the high distinct score also makes me confused.

hi, do you mind if we sync our code for evaluating the generated responses? it will makes our result directly comparable. Hi, my evaluation metric are saved in folder metric and eval.py show the introduction of using them. By the way, I will also try the Google w2v, by the way.

Forget somthing, I tried the OpenNMT-py's transformer on Dailydialog, the results are shown as follow:

BLEU(1/2/3/4): 0.2258/0.1505/0.1281/0.1183

ROUGE: 0.1246

Dist-1: 0.0583

Dist-2: 0.2479

But OpenNMT-py performs much worse than my implementation on other datasets, whatever.

Sorry, I make a mistake. The results above are the Seq2Seq model of OpenNMT-py. Transformer is much worse.

sorry, I don't agree with you for this point, I also tried seq2seq with attention and transformer, both result is much better than yours. the bleu-4 score is high than 0.2 and dist-2 is high than 0.3

gmftbyGMFTBY commented 4 years ago

@marvinzh Hi, here I provide the code for training the transformer for OpenNMT-py (PyTorch version) (borrow from the OpenNMT-py FAQ):

CUDA_VISIBLE_DEVICES="7" python train.py -data ./data/ubuntu_tf/demo \
    -save_model ./data/ubuntu_tf/demo-model \
    -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
    -encoder_type transformer -decoder_type transformer -position_encoding \
    -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
    -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
    -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
    -max_grad_norm 0 -param_init 0  -param_init_glorot \
    -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
    -world_size 1 -gpu_ranks 0

Train the Seq2Seq-attn

CUDA_VISIBLE_DEVICES="1,2,3,4" python train.py \
    -data data/xdialogue_twitter/demo \
    -save_model data/xdialogue_twitter/demo-model \
    -world_size 4 \
    -gpu_ranks 0 1 2 3 \
    -valid_batch_size 16 \
    -master_port 10001 \

translate script:

python translate.py \
    -model data/$1_tf/demo-model_step_200000.pt \
    -src data/$1_tf/src-test.txt \
    -output data/$1_tf/pred.txt \
    -replace_unk -verbose --n_best 1 --log_file data/$1_tf/log.txt --gpu 0

And I already finish the training transformers on this five datasets (Dailydialog, DSTC7-AVSD, PersonaChat, EmpChat, Ubuntu). The results are shown as follow:

Dataset	BLEU-1	BLEU-2	BLEU-3	BLEU-4	ROUGE-2	Dist-1	Dist-2	GroundTruth Dist-1	GroundTruth Dist-2
DailyDialog	0.2258	0.1505	0.1281	0.1183	0.1246	0.0583	0.2479	0.0723	0.3806
DSTC7-AVSD	0.2246	0.1048	0.0578	0.0344	0.057	0.0419	0.1488	0.0803	0.3809
PersonaChat	0.1789	0.0652	0.0273	0.0123	0.029	0.0135	0.0467	0.0348	0.2628
EmpChat	0.0984	0.0243	0.0084	0.0038	0.0088	0.0946	0.3572	0.0971	0.4696
Ubuntu	0.1154	0.0188	0.0073	0.0041	0.0046	0.1996	0.5487	0.2203	0.6935

As you can see, the performance of transformer on BLEU is really bad on other datasets (Although the BLEU may not be suitable for measuring the performance). But the very low BLEU scores means the bad results. I also check the generated results and it seems terrible (The terrible here, I means really terrible. Especially on PersonaChat. I'm really out of mind, can you tell me how to obtain the good results, I really need these results in my paper.). For example, the BLEU-4 score of the HRAN model on PersonaChat in this repo is 0.0303, which is better than 0.0123.

Do you also use the OpenNMT-py and try these datasets? I really need a good performance of the transformer. Can you tell me the parameters you used?

Right now, I'm trying to leverage the ParlAI to obtain the results, the results will be uploaded in this issue soon.

gmftbyGMFTBY commented 4 years ago

Or you didn't use the OpenNMT transformer? How to reproduce the better performance of transformer? Very difficult to me ......

marvinzh commented 4 years ago

actually, I tried opennmt's transformer implementation at first in order to get comparable result, but same as you, the model performs much worser, so I didn't continue on this. FYI, my transformer model got this score on dailydialog without carefully tuning the hyperparameters. I recommend you to train the model sufficiently, for example 200+ epoch on dailydialog to see if performance improves

bleu-4:  0.2158586328135662
distinct-1:  0.06936866718628215
distinct-2:  0.3212877016462363
average:  0.5071457361432529
greedy:  0.4351326900878399
extrema:  0.41749601489836324

gmftbyGMFTBY commented 4 years ago

Hi thank you for your prompt response. You mean that you use your trs repo to obtain this results of transformer?

gmftbyGMFTBY commented 4 years ago

Hi, If the results are obtained by your trs repo. I really want to have a try. Is it easy to process other datasets (I find that your repo seems to build for translation task)?

I notice you use the weibo dataset in the trs.git. So is it easy and straightforward to change the corpus for me?

Oh, If so, can you provide the format of the weibo dataset for your trs.git repo. Really appreciate.

Hope to get your respones

marvinzh commented 4 years ago

Hi,

Hi thank you for your prompt response. You mean that you use your trs repo to obtain this results of transformer?

yes.

if there is no hurry, you can use the software that I will release in the start of March, which is fully dialog system oriented and easy to try new data/models. the repo I open sourced yesterday was originally used for testing my transformer implementation and did not excepted to open source before, so there are no detailed documentation about it.

if you need a quick experiments, just follow the IWSLT14 example(where the data format is identical to the format used in OpenNMT, so I think you can just reuse the file you prepared before) , replace the filename with yours, then i think it will be fine to run the quick experiments. https://github.com/marvinzh/trs/tree/master/data https://github.com/marvinzh/trs/blob/515d910453de56b7a4e6b23d0b8f24f8ed1a3c1f/data_utils.py#L40

gmftbyGMFTBY commented 4 years ago

Hi, I run 200 epochs for HRED on Dailydialog dataset last night, the results seem not be changed. Thank you, I will use your model to do the experiments.

marvinzh commented 4 years ago

hi, I would like to ask what is the data structure of the training data in your HRED implementation?

gmftbyGMFTBY commented 4 years ago

Hi, the raw data contains 6 files:

src-train.txt
tgt-train.txt
src-test.txt
tgt-test.txt
src-dev.txt
tgt-dev.txt

One line one sentence, src file contains the dialogue query (context), tgt file contains the response to the context.

Hi, do you try your model on other datasets? Do you need I provide you my processed datasets (Dailydialog, Ubuntu, DSTC7, PersonaChat, EmpChat) for you, so we can make the conclusion on the same benchmark. Maybe I can send these datasets to you by email?

gmftbyGMFTBY commented 4 years ago

@marvinzh Hi, I run your transformer models. But it seems to raise some exceptions, I find it is made by the wrong pytorch version, can you share your version of your pytorch package. Thank you so much.

Error: cuda runtime error (710).

Oh, the length of the utterance must be limited in 512 tokens. I got it. Thank you so much for your wonderful work!

marvinzh commented 4 years ago

Do you need I provide you my processed datasets (Dailydialog, Ubuntu, DSTC7, PersonaChat, EmpChat) for you, so we can make the conclusion on the same benchmark. Maybe I can send these datasets to you by email?

hi, sorry for the late response I'm going to run my experiments on ubuntu-v2 too, but I didn't start it yet. I didn't expect to run the experiments on PersonaChat and EmpChat before, but if you send me these 2 dataset, or download links, why not have a try? it's good if we can exchange the baseline experiments results, which will help us to grasp our experiments better.

b.t.w is anything OK to run my transformer baseline? if you got any problems please let me know

marvinzh commented 4 years ago

hi, do you mind to join my slack chatting room so that we can exchange our progress smoothly? if ok, i will send you the invitation links @gmftbyGMFTBY

gmftbyGMFTBY commented 4 years ago

Sorry for the late respone, I got lots of stuff to do. Thank you so much. I'm running your transformer model on dailydialog corpus, so far, it works fine.

Yeah, I can show you the link to download these two datasets:

PersonaChat (Already processed): You can find the link to download processed PersonaChat, DSTC7-AVSD, Dailydialog in this page
EmpChat: wget it

The preprocess scripts can be found in the data/data_process folder in this repo.

Hi, I never use slack before, but I sign up just now. My account: GMFTBY, gmftby.slack.com.

Hope you have a good day!

marvinzh commented 4 years ago

thank you for sharing the data, here is the invitation links, you can login to my chatbot channel ~~https://slack.com/share/IUB1B29CY/wFoCNfoKQRtK1Bc4hKtKjtXB/enQtOTYzMDQ1MDc3NDQwLTYxYmNiYTlkN2I2NTc3MTE1ZmJlNzUwNTg2OTU5MTg4ZWMyMmNlZGI5OTcwNTRjYmJlM2ZiYzkwYTllO~~ sorry, I made an mistake, here is the right link https://join.slack.com/t/chatbot-discussionhq/shared_invi

gmftbyGMFTBY commented 4 years ago

Okay, thanks. I confirm the link.

gmftbyGMFTBY / MultiTurnDialogZoo

baseline models #1