Input data format? - Githubissues

JohannesTK commented 6 years ago

I see you use Twitter data for training: https://github.com/shawnspace/HRAN/blob/master/prepare_context_RG_data.py#L19

What's the format of it and do you have a proper README for training the model?

Thanks.

shawnspace commented 6 years ago

You can obtain the Twitter Dialog Corpus by emailing its author. I do some preprocessing on it and store it in the dialog.txt file.

The format of dialog.txt file is: q1\ta1\tq2\ta2\n q1\ta1\tq2\ta2\n ...

For each line, there are several utterances (like q1, a1 here) and you can split them by '\t'. For each utterance, I have already conducted word tokenization and you can split each utterance by whitespace to get each word token.

To train the model, you need to firstly use prepare_context_RG_data.py to generate the .tfrecords files. Then you can use train.py to train the model.

Hope this helps you

JohannesTK commented 6 years ago

Thank you for the fast answer.

Regards the rg_vocab.txt: https://github.com/shawnspace/HRAN/blob/master/prepare_context_RG_data.py#L17

This contains all the unique words separated by new line in the data with unk being index 0 and eos index 1? https://github.com/shawnspace/HRAN/blob/master/predict.py#L91

At the same time the eos is commented out in the word embeddings: https://github.com/shawnspace/HRAN/blob/master/helper.py#L43

Anything I am missing?

Also, were you able to reproduce the results given in the paper?

shawnspace commented 6 years ago

Yes, the rg_vocab.txt file contains all the unique words separated by new line character. The unk index is 0 and eos index is 1.

The commented line in https://github.com/shawnspace/HRAN/blob/master/helper.py#L43 is no necessary because you can initialize both eos and unk token with a random number, so I delete them. Thanks for pointing it out.

The paper uses a different dataset so I can't compare my model's performance to it.

JohannesTK commented 6 years ago

I see, thanks.

Do you have any advice regards a dialog with the format of?

q1 - a1 - a2 - q2 - a3

Meaning there are multiple answers to one question.

shawnspace commented 6 years ago

I think it depends on your task requirements. For example, if you are required to just output one response, you can randomly choose one answer as training example because all of them are plausible responses to that specific query.

shawnspace commented 6 years ago

I made some changes to the rg_vocab.txt file format. It contains all the unique words separated by new line with unk being index 0 and sos being index 1 and eos being index 2.

Please be noted with this and the difference in prepare_context_RG_data.file, where you need to append different index to a response id list in .tfrecords file.

JohannesTK commented 6 years ago

Thanks!

JohannesTK commented 6 years ago

Seems the training never starts as nvidia-smi reports nothing loaded to memory. It's just stuck after the "Begin training"

INFO:tensorflow:model dir ./model/persona_chat1
INFO:tensorflow:check point None
INFO:tensorflow:Begin training

JohannesTK commented 6 years ago

Do you happen to have any inference results with your own dataset?

Currently testing with multiple context architectures and would love to see your HRAN results if you'd care to share.

shawnspace commented 6 years ago

I am not sure what happend on your machine because there is too little information.

I have some results but they are tested on Facebook Persona-chat dialog corpus. I once compared HRAN with HRED. Surprisingly, HRAN performs bad than HRED. But please note that I didn't carefully tune the hyper-parameters on HRAN and HRED.

JohannesTK commented 6 years ago

I see, thanks for the info.

shawnspace / HRAN

Input data format? #2