Open JohannesTK opened 6 years ago
You can obtain the Twitter Dialog Corpus by emailing its author. I do some preprocessing on it and store it in the dialog.txt file.
The format of dialog.txt file is: q1\ta1\tq2\ta2\n q1\ta1\tq2\ta2\n ...
For each line, there are several utterances (like q1, a1 here) and you can split them by '\t'. For each utterance, I have already conducted word tokenization and you can split each utterance by whitespace to get each word token.
To train the model, you need to firstly use prepare_context_RG_data.py to generate the .tfrecords files. Then you can use train.py to train the model.
Hope this helps you
Thank you for the fast answer.
Regards the rg_vocab.txt: https://github.com/shawnspace/HRAN/blob/master/prepare_context_RG_data.py#L17
This contains all the unique words separated by new line in the data with unk being index 0 and eos index 1? https://github.com/shawnspace/HRAN/blob/master/predict.py#L91
At the same time the eos is commented out in the word embeddings: https://github.com/shawnspace/HRAN/blob/master/helper.py#L43
Anything I am missing?
Also, were you able to reproduce the results given in the paper?
Yes, the rg_vocab.txt file contains all the unique words separated by new line character. The unk index is 0 and eos index is 1.
The commented line in https://github.com/shawnspace/HRAN/blob/master/helper.py#L43 is no necessary because you can initialize both eos and unk token with a random number, so I delete them. Thanks for pointing it out.
The paper uses a different dataset so I can't compare my model's performance to it.
I see, thanks.
Do you have any advice regards a dialog with the format of?
q1 - a1 - a2 - q2 - a3
Meaning there are multiple answers to one question.
I think it depends on your task requirements. For example, if you are required to just output one response, you can randomly choose one answer as training example because all of them are plausible responses to that specific query.
I made some changes to the rg_vocab.txt file format. It contains all the unique words separated by new line with unk being index 0 and sos being index 1 and eos being index 2.
Please be noted with this and the difference in prepare_context_RG_data.file, where you need to append different index to a response id list in .tfrecords file.
Thanks!
Seems the training never starts as nvidia-smi reports nothing loaded to memory. It's just stuck after the "Begin training"
INFO:tensorflow:model dir ./model/persona_chat1
INFO:tensorflow:check point None
INFO:tensorflow:Begin training
Do you happen to have any inference results with your own dataset?
Currently testing with multiple context architectures and would love to see your HRAN results if you'd care to share.
I am not sure what happend on your machine because there is too little information.
I have some results but they are tested on Facebook Persona-chat dialog corpus. I once compared HRAN with HRED. Surprisingly, HRAN performs bad than HRED. But please note that I didn't carefully tune the hyper-parameters on HRAN and HRED.
I see, thanks for the info.
I see you use Twitter data for training: https://github.com/shawnspace/HRAN/blob/master/prepare_context_RG_data.py#L19
What's the format of it and do you have a proper README for training the model?
Thanks.