guxd / DialogWAE

Source Code for DialogWAE: Multimodal Response Generation with Conditional Wasserstein Autoencoder (https://arxiv.org/abs/1805.12352)
Other
125 stars 25 forks source link

Code explanation about data prepocessing #11

Open sinlin0908 opened 4 years ago

sinlin0908 commented 4 years ago

Hello, thank for your open source. I am trying to understand your code. However, in the data.py, it is confused for me to preprocess the data.

In building vocabulary,

print("Load corpus with train size %d, valid size %d, "
              "test size %d raw vocab size %d vocab size %d at cut_off %d OOV rate %f"
              % (len(self.train_corpus), len(self.valid_corpus), len(self.test_corpus),
                 raw_vocab_size, len(vocab_count), vocab_count[-1][1], float(discard_wc) / len(all_words)))

What do the train size, valid size, and test size mean? The values of all are 2 since they are a tuple with length of 2.

Do you mean that all vocabularies are from the training, testing, and validation data? However, it only uses the training data to build the vocabulary in the code.

In formatting dialogue, Is it essential to add [\<s>,\<d>,\</s>] in the start of the dialogue? Can I not use this?

thank you.

Bortrex commented 4 years ago

Hi there, Hope I can help you. Im only using DailyDialog dataset.

sinlin0908 commented 3 years ago

Thank you for the explanation!