Code explanation about data prepocessing

sinlin0908 commented 4 years ago

Hello, thank for your open source. I am trying to understand your code. However, in the data.py, it is confused for me to preprocess the data.

In building vocabulary,

print("Load corpus with train size %d, valid size %d, "
              "test size %d raw vocab size %d vocab size %d at cut_off %d OOV rate %f"
              % (len(self.train_corpus), len(self.valid_corpus), len(self.test_corpus),
                 raw_vocab_size, len(vocab_count), vocab_count[-1][1], float(discard_wc) / len(all_words)))

What do the train size, valid size, and test size mean? The values of all are 2 since they are a tuple with length of 2.

Do you mean that all vocabularies are from the training, testing, and validation data? However, it only uses the training data to build the vocabulary in the code.

In formatting dialogue, Is it essential to add [\<s>,\<d>,\</s>] in the start of the dialogue? Can I not use this?

thank you.

Bortrex commented 4 years ago

Hi there, Hope I can help you. Im only using DailyDialog dataset.

However, it only uses the training data to build the vocabulary in the code. In the paper is mentioned the ratio between train/valid/test and if you check DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset the size of the vocabulary is bigger than the one used here. However given the dimensions of train/valid/test sets, it is fair to assume that those missing tokens would be super rare.
Is it essential to add [,,]" in the start of the dialogue? Yes, with this you will indicate to the dialogue system to reset and start over. Or that the next sentence following [,,]" is part from another topic.

sinlin0908 commented 3 years ago

Thank you for the explanation!

guxd / DialogWAE

Code explanation about data prepocessing #11