Closed marvinzh closed 4 years ago
Hi, thank you for your attention. 1) The baselines which are ready to use: HRED, VHRED, ReCoSa (MReCoSa), HRAN, WSeq, DSHRED, Seq2Seq-attn. The transformer models are still in development. If you can provide the Transformers (PyTorch version, and at least better than Seq2Seq-attn) I will very appreciate it.
2) About the performance: Hi, I'm doing extensive experiments about these models and report the performance soon. So far, here are my partial observations on the DailyDialog dataset and I will push these experimental results into this repo in a month. (I'm so sorry that GitHub seems can not upload the .png file. Maybe I will show them in this issue soon.)
3) Thank you for your help: I'm still a newbie NLPer for the dialogue systems. I'm still struggling to finish the transformer-based baselines such as GPT2 or Transformer(Seq2Seq). If you're familiar with transformer (PyTorch version) and can provide the Transformer codes, I will be very thankful.
4) The code structure are unsatisfied. I will retouch them in about a month.
Hope you have a good day.
Hi, thank you for your prompt and informative reply.
The baselines which are ready to use: HRED, VHRED, ReCoSa (MReCoSa), HRAN, WSeq, DSHRED, Seq2Seq-attn.
It is great! I'll try it.
I'm still struggling to finish the transformer-based baselines such as GPT2 or Transformer(Seq2Seq).
I have been working on applying transformer-based model on multi-turn dialog modeling these months and also implementing a toolkit for quick prototyping of multi-turn dialog modeling but mainly focusing on transformer-based models.
my current transformer implementation (in pytorch) got decent score (34.4 bleu score, small setting) on IWSLT14-de-en when comparing with other available implementation on the github. I'll open source it later and it would be nice if it could be helpful to you. :)
Thank you.
Amazing, cannot wait to learn from your codes, Thank you so much!
Hi, Do you apply your transformer model on the dialogue corpus to measure the performance? And when will you release the codes? I'm very excited about it :)
Hope to get your response.
Hi, sorry for the late response.
yesterday I made a mistake, the bleu score obtained from my implementation is not , it's 34.4
33.4
. I mixed it with the score reported in other papers. anyway, I open source my transformer implementation and data at
https://github.com/marvinzh/trs.git
AFAIK, it's quite tricky to train the transformer model. there are many factors (e.g. clip_norm, optimizer, lr ,etc.) that could affect the performance even the implementation is right. In my experiments, I test my model under a single RTX2070+cu101
Also, I'd like to ask do you have evaluation score on dailydialog
dataset from these baseline models supported in your software. it would be great if you could share it with me! :)
Hi, Do you apply your transformer model on the dialogue corpus to measure the performance? And when will you release the codes? I'm very excited about it :)
Hope to get your response.
Hi, first of all, thank you so much for your open source codes. I think I will learn a lot from it.
Yes, I can share the results with you (The results are in TensorBoard, but the .png file cannot be uploaded in GitHub, so I share the partial results with you). For all the datasets (Dailydialog, EmpChat, PersonaChat, DSTC7-AVSD, Ubuntu), I use these metrics for evaluating (Human evaluation is hard to obtain, so I don't use it.)
The results of the baselines on Dailydialog dataset are shown as follows: | Models | PPL | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-2 | Dist-1 | Dist-2 | EA | VX |
---|---|---|---|---|---|---|---|---|---|---|---|
Seq2Seq | 28.69 | 0.2178 | 0.1015 | 0.0591 | 0.0382 | 0.0557 | 0.0244 | 0.1204 | 0.8738 | 0.8407 | |
HRED | 31.49 | 0.1971 | 0.0863 | 0.0458 | 0.0261 | 0.0443 | 0.0147 | 0.0711 | 0.868 | 0.833 | |
WSeq | 36.05 | 0.2006 | 0.0843 | 0.0431 | 0.0238 | 0.0385 | 0.0114 | 0.0710 | 0.8675 | 0.8333 | |
DSHRED | 37.59 | 0.2054 | 0.0922 | 0.0504 | 0.0301 | 0.0489 | 0.0185 | 0.0885 | 0.08705 | 0.836 | |
VHRED | 32.63 | 0.1828 | 0.0776 | 0.0401 | 0.0226 | 0.0387 | 0.0166 | 0.0698 | 0.8631 | 0.8257 | |
HRAN | 28.69 | 0.2263 | 0.1109 | 0.0678 | 0.0461 | 0.0631 | 0.0267 | 0.1371 | 0.8741 | 0.8407 | |
ReCoSa | 34.46 | 0.1911 | 0.0832 | 0.044 | 0.0251 | 0.0424 | 0.0124 | 0.0580 | 0.8649 | 0.8293 |
The responses generated by the models looks good for me.
Yes, the transformer model is very hard to train which is the fatal weakness of it. And I try the implementation in OpenNMT-py
to train the multi-turn dialogue systems. I'm glad to see that the OpenNMT-py
's transformer model is only better than my implemented models on Dailydialog dataset and much worse than mine on other four datasets. So I think the implementation in this repo is just fine.
@marvinzh, Hi, can you try the Dailydialog dataset on your trs
repo and share the results with me?
I really hate the transformers right now, and already write the new baseline which have the transformer encoder and GRU decoder (transformer decode cannot work in my implementation).
Maybe we can share the experimental results with each other?
Thank you for sharing the result.
Could you also share the ground truth - generated response pair of the model shown in the above table so that I can test it under my environment. (any format is ok, I can process it later) because I found some of the score obtained by me is quite different from yours. and for the embedding-based metric, I noticed you use pre-trained glove embedding while I'm using google's word2vec, so I guess it might affect the result a little bit?
Maybe we can share the experimental results with each other?
Sure! it would be great!
Hi, I try to upload the generated file but GitHub always raise the error (Something went really wrong, and we can't process that file), which is the same with the .png file. (I don't get it.)
Maybe you can give me your email address and I can send it to you.
Sure, my mail is baiyuu.cs[AT]gmail.com
, feel free to contact me!
The mail has been sent. If you find any questions, feel free to contact me. By the way, do you think that the transformer encoder and GRU decoder will work together?
Hi, Thank you for sharing the file!
By the way, do you think that the transformer encoder and GRU decoder will work together?
well, it's hard to say. in your case, the decoder GRU need to generated each word conditioned on the output of encoder transformer, while the output of encoder is actually contextual word embeddings, I'm sure you can eventually make it work but I doubt the performance.
as for the file you sent me before, is that result for HRED
model?
I test it using BLEU-4
, dist-n
and embedding based similarity. here is the score in my environments. I found it generally aligns with yours except the embedding-based one (that's totally fine)
bleu:: 0.026060156595983073
distinct-1: 0.014626877377357751
distinct-2: 0.07118675084637203
average: 0.40945895230457285
greedy: 0.3217714133217234
extrema: 0.29181506281867986
Also, I would like to confirm what data contained by the file you sent me before, is it test set of dailydialog
?
FYI, today I also tried another open source implementation of HRED
where it gives 0.074099
and 0.333950
on dist-1
and dist-2
, respectively. It's little bit weird that all of your models have low dist-n
score.
hi, do you mind if we sync our code for evaluating the generated responses? it will makes our result directly comparable. If so, I can create a repo so that we can share and sync our implementation for different metric
1) Also, I would like to confirm what data contained by the file you sent me before, is it test set of dailydialog ? FYI, today I also tried another open source implementation of HRED where it gives 0.074099 and 0.333950 on dist-1 and dist-2, respectively. It's little bit weird that all of your models have low dist-n score.
Yes, it's the test dataset of the dailydialog, and I also calculate the ground-truth dist-1
and dist-2
, the results are 0.0577
and 0.3594
. So the high distinct
score also makes me confused.
2) hi, do you mind if we sync our code for evaluating the generated responses? it will makes our result directly comparable.
Hi, my evaluation metric are saved in folder metric
and eval.py
show the introduction of using them. By the way, I will also try the Google w2v, by the way.
Forget somthing, I tried the OpenNMT-py
's transformer on Dailydialog, the results are shown as follow:
But OpenNMT-py
performs much worse than my implementation on other datasets, whatever.
Sorry, I make a mistake. The results above are the Seq2Seq model of OpenNMT-py
. Transformer is much worse.
- Also, I would like to confirm what data contained by the file you sent me before, is it test set of dailydialog ? FYI, today I also tried another open source implementation of HRED where it gives 0.074099 and 0.333950 on dist-1 and dist-2, respectively. It's little bit weird that all of your models have low dist-n score. Yes, it's the test dataset of the dailydialog, and I also calculate the ground-truth
dist-1
anddist-2
, the results are0.0577
and0.3594
. So the highdistinct
score also makes me confused.- hi, do you mind if we sync our code for evaluating the generated responses? it will makes our result directly comparable. Hi, my evaluation metric are saved in folder
metric
andeval.py
show the introduction of using them. By the way, I will also try the Google w2v, by the way.Forget somthing, I tried the
OpenNMT-py
's transformer on Dailydialog, the results are shown as follow:
- BLEU(1/2/3/4): 0.2258/0.1505/0.1281/0.1183
- ROUGE: 0.1246
- Dist-1: 0.0583
- Dist-2: 0.2479
But
OpenNMT-py
performs much worse than my implementation on other datasets, whatever.Sorry, I make a mistake. The results above are the Seq2Seq model of
OpenNMT-py
. Transformer is much worse.
sorry, I don't agree with you for this point, I also tried seq2seq with attention and transformer, both result is much better than yours. the bleu-4 score is high than 0.2
and dist-2 is high than 0.3
@marvinzh Hi, here I provide the code for training the transformer for OpenNMT-py (PyTorch version)
(borrow from the OpenNMT-py FAQ):
CUDA_VISIBLE_DEVICES="7" python train.py -data ./data/ubuntu_tf/demo \
-save_model ./data/ubuntu_tf/demo-model \
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \
-encoder_type transformer -decoder_type transformer -position_encoding \
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 \
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
-max_grad_norm 0 -param_init 0 -param_init_glorot \
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
-world_size 1 -gpu_ranks 0
Train the Seq2Seq-attn
CUDA_VISIBLE_DEVICES="1,2,3,4" python train.py \
-data data/xdialogue_twitter/demo \
-save_model data/xdialogue_twitter/demo-model \
-world_size 4 \
-gpu_ranks 0 1 2 3 \
-valid_batch_size 16 \
-master_port 10001 \
translate script:
python translate.py \
-model data/$1_tf/demo-model_step_200000.pt \
-src data/$1_tf/src-test.txt \
-output data/$1_tf/pred.txt \
-replace_unk -verbose --n_best 1 --log_file data/$1_tf/log.txt --gpu 0
And I already finish the training transformers on this five datasets (Dailydialog, DSTC7-AVSD, PersonaChat, EmpChat, Ubuntu). The results are shown as follow:
Dataset | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | ROUGE-2 | Dist-1 | Dist-2 | GroundTruth Dist-1 | GroundTruth Dist-2 |
---|---|---|---|---|---|---|---|---|---|
DailyDialog | 0.2258 | 0.1505 | 0.1281 | 0.1183 | 0.1246 | 0.0583 | 0.2479 | 0.0723 | 0.3806 |
DSTC7-AVSD | 0.2246 | 0.1048 | 0.0578 | 0.0344 | 0.057 | 0.0419 | 0.1488 | 0.0803 | 0.3809 |
PersonaChat | 0.1789 | 0.0652 | 0.0273 | 0.0123 | 0.029 | 0.0135 | 0.0467 | 0.0348 | 0.2628 |
EmpChat | 0.0984 | 0.0243 | 0.0084 | 0.0038 | 0.0088 | 0.0946 | 0.3572 | 0.0971 | 0.4696 |
Ubuntu | 0.1154 | 0.0188 | 0.0073 | 0.0041 | 0.0046 | 0.1996 | 0.5487 | 0.2203 | 0.6935 |
As you can see, the performance of transformer on BLEU is really bad on other datasets (Although the BLEU may not be suitable for measuring the performance). But the very low BLEU scores means the bad results. I also check the generated results and it seems terrible (The terrible here, I means really terrible. Especially on PersonaChat. I'm really out of mind, can you tell me how to obtain the good results, I really need these results in my paper.). For example, the BLEU-4 score of the HRAN model on PersonaChat in this repo is 0.0303
, which is better than 0.0123
.
Do you also use the OpenNMT-py
and try these datasets? I really need a good performance of the transformer. Can you tell me the parameters you used?
Right now, I'm trying to leverage the ParlAI to obtain the results, the results will be uploaded in this issue soon.
Or you didn't use the OpenNMT transformer? How to reproduce the better performance of transformer? Very difficult to me ......
actually, I tried opennmt's transformer implementation at first in order to get comparable result, but same as you, the model performs much worser, so I didn't continue on this.
FYI, my transformer model got this score on dailydialog
without carefully tuning the hyperparameters.
I recommend you to train the model sufficiently, for example 200+ epoch on dailydialog
to see if performance improves
bleu-4: 0.2158586328135662
distinct-1: 0.06936866718628215
distinct-2: 0.3212877016462363
average: 0.5071457361432529
greedy: 0.4351326900878399
extrema: 0.41749601489836324
Hi thank you for your prompt response. You mean that you use your trs
repo to obtain this results of transformer?
Hi, If the results are obtained by your trs
repo. I really want to have a try. Is it easy to process other datasets (I find that your repo seems to build for translation task)?
I notice you use the weibo dataset in the trs.git
. So is it easy and straightforward to change the corpus for me?
Oh, If so, can you provide the format of the weibo
dataset for your trs.git
repo. Really appreciate.
Hope to get your respones
Hi,
Hi thank you for your prompt response. You mean that you use your trs repo to obtain this results of transformer?
yes.
if there is no hurry, you can use the software that I will release in the start of March, which is fully dialog system oriented and easy to try new data/models. the repo I open sourced yesterday was originally used for testing my transformer implementation and did not excepted to open source before, so there are no detailed documentation about it.
if you need a quick experiments, just follow the IWSLT14 example(where the data format is identical to the format used in OpenNMT, so I think you can just reuse the file you prepared before) , replace the filename with yours, then i think it will be fine to run the quick experiments. https://github.com/marvinzh/trs/tree/master/data https://github.com/marvinzh/trs/blob/515d910453de56b7a4e6b23d0b8f24f8ed1a3c1f/data_utils.py#L40
Hi, I run 200 epochs for HRED on Dailydialog dataset last night, the results seem not be changed. Thank you, I will use your model to do the experiments.
hi, I would like to ask what is the data structure of the training data in your HRED
implementation?
Hi, the raw data contains 6 files:
One line one sentence, src file contains the dialogue query (context), tgt file contains the response to the context.
Hi, do you try your model on other datasets? Do you need I provide you my processed datasets (Dailydialog, Ubuntu, DSTC7, PersonaChat, EmpChat) for you, so we can make the conclusion on the same benchmark. Maybe I can send these datasets to you by email?
@marvinzh Hi, I run your transformer models. But it seems to raise some exceptions, I find it is made by the wrong pytorch
version, can you share your version of your pytorch
package. Thank you so much.
Error: cuda runtime error (710).
Oh, the length of the utterance must be limited in 512 tokens. I got it. Thank you so much for your wonderful work!
Do you need I provide you my processed datasets (Dailydialog, Ubuntu, DSTC7, PersonaChat, EmpChat) for you, so we can make the conclusion on the same benchmark. Maybe I can send these datasets to you by email?
hi, sorry for the late response
I'm going to run my experiments on ubuntu-v2
too, but I didn't start it yet. I didn't expect to run the experiments on PersonaChat
and EmpChat
before, but if you send me these 2 dataset, or download links, why not have a try?
it's good if we can exchange the baseline experiments results, which will help us to grasp our experiments better.
b.t.w is anything OK to run my transformer baseline? if you got any problems please let me know
hi, do you mind to join my slack chatting room so that we can exchange our progress smoothly? if ok, i will send you the invitation links @gmftbyGMFTBY
Sorry for the late respone, I got lots of stuff to do. Thank you so much. I'm running your transformer model on dailydialog corpus, so far, it works fine.
Yeah, I can show you the link to download these two datasets:
The preprocess scripts can be found in the data/data_process
folder in this repo.
Hi, I never use slack before, but I sign up just now. My account: GMFTBY, gmftby.slack.com
.
Hope you have a good day!
thank you for sharing the data,
here is the invitation links, you can login to my chatbot channel
https://slack.com/share/IUB1B29CY/wFoCNfoKQRtK1Bc4hKtKjtXB/enQtOTYzMDQ1MDc3NDQwLTYxYmNiYTlkN2I2NTc3MTE1ZmJlNzUwNTg2OTU5MTg4ZWMyMmNlZGI5OTcwNTRjYmJlM2ZiYzkwYTllO
sorry, I made an mistake, here is the right link
https://join.slack.com/t/chatbot-discussionhq/shared_invi
Okay, thanks. I confirm the link.
Hi, I found this repo is quite helpful for the dialog system research community. I was wondering are these baseline models provided in this repo ready to use or still in the development? Thank you!