Closed scottfleming closed 5 years ago
@scottfleming Sorry for the confusion. The distinction of trainset and testset is only for the pretraining phase. For end-to-end training DCC runs on complete dataset. In completely unsupervised scenario (which is the case here) there is no seperate thing as training and inference phase. Hence this suggests that the graph construction should be done on the complete dataset. Please maintain the same ordering of the data throughout i.e., [trainset, testset]. https://github.com/shahsohil/DCC/blob/63f3851ca970ef656d1a2d58f825454cd3ab7681/pytorch/extract_feature.py#L69
Ah I see, this is making more sense. It's unclear to me then when you would actually use the training and testing feature. In your Reuters example, I'm assuming you trained the SDAE on the train set (~8500 articles), used extract_feature.py to get the embeddings for both the train set (~8500 articles) and the test set (~1500 articles), but then edgeConstruction.py on just the training set?
Are you suggesting that the split into train and test in this case is just to provide a validation loss for the SDAE training to ensure that we're not overfitting, for example?
Ultimately, I'm just trying to reproduce your Reuters results to ensure that I get the same results that you published on my local environment. There are a few spots, though, where I'm struggling to fill in the gaps. For example, in the edgeConstruction.py file you mention that "PCA is computed for Text dataset. Please refer RCC paper for exact details". But I don't see anything in the RCC paper about Text dataset-specific PCA preprocessing. Am I missing something obvious?
Thanks for all your help!
Hi @scottfleming,
edgeConstruction.py is applied on the complete dataset. Let me clarify the steps:
Regarding details of PCA, it is mentioned under Datasets section in the supplementary of RCC work.
It's unclear to me from the documentation how pretrained.mat is supposed to be generated. pretraining.py takes in data/mydataset/traindata.mat and data/mydataset/testdata.mat and spits out data/mydataset/results/checkpoint_4.pth.tar such that when extract_feature.py takes in checkpoint_4.pth.tar it spits out a matrix of n_train + n_test. But are we then supposed to run RCC's edgeConstruction module on traindata.mat or testdata.mat or a combination of the two in order to produce pretrained.mat? If we do it on just one of them and then feed the resulting graph into copyGraph.py it'll throw a shape mismatch error...