I want to try your model on my own customized dataset but I'm overwhelmed by your codebase. So a few questions:
Does the dataset need to be with topical words for the training procedure or only for generation? Or both?
If I have a customized dataset, where do I start? From my understanding I need to:
A. Create dialogues file like yours, in your reddit format.
B. Use LDA script to create topical words and attach them to the dataset?
C. Create a vocabulray file somehow?
D. Then train?
Am I missing something?
Can you please let me know which file/script I need to use for each stage?
For THRED and Topic-Aware seq2seq, you need topic words for training and generation. However, for HRED and vanilla seq2seq, you don't need topic words at all.
To be able to use your own data, just make sure the data format matches the description.
You can use the provided pre-trained LDA model (here) to infer topic words for your own data as the following:
python thred/topic_model/lda.py --mode infer --dialogue_as_doc
--model_dir <PATH TO THE DOWNLOADED MODEL> --test_data <PATH TO YOUR DATA FILE>
The vocabulary file will be automatically created during training, so no need to worry about it.
You are all set to training the model! See here for more information.
I want to try your model on my own customized dataset but I'm overwhelmed by your codebase. So a few questions: