shahsohil / DCC

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper
MIT License
208 stars 53 forks source link

Ambiguity surrounding pretrained.mat #8

Closed scottfleming closed 5 years ago

scottfleming commented 5 years ago

It's unclear to me from the documentation how pretrained.mat is supposed to be generated. pretraining.py takes in data/mydataset/traindata.mat and data/mydataset/testdata.mat and spits out data/mydataset/results/checkpoint_4.pth.tar such that when extract_feature.py takes in checkpoint_4.pth.tar it spits out a matrix of n_train + n_test. But are we then supposed to run RCC's edgeConstruction module on traindata.mat or testdata.mat or a combination of the two in order to produce pretrained.mat? If we do it on just one of them and then feed the resulting graph into copyGraph.py it'll throw a shape mismatch error...

shahsohil commented 5 years ago

@scottfleming Sorry for the confusion. The distinction of trainset and testset is only for the pretraining phase. For end-to-end training DCC runs on complete dataset. In completely unsupervised scenario (which is the case here) there is no seperate thing as training and inference phase. Hence this suggests that the graph construction should be done on the complete dataset. Please maintain the same ordering of the data throughout i.e., [trainset, testset]. https://github.com/shahsohil/DCC/blob/63f3851ca970ef656d1a2d58f825454cd3ab7681/pytorch/extract_feature.py#L69

scottfleming commented 5 years ago

Ah I see, this is making more sense. It's unclear to me then when you would actually use the training and testing feature. In your Reuters example, I'm assuming you trained the SDAE on the train set (~8500 articles), used extract_feature.py to get the embeddings for both the train set (~8500 articles) and the test set (~1500 articles), but then edgeConstruction.py on just the training set?

Are you suggesting that the split into train and test in this case is just to provide a validation loss for the SDAE training to ensure that we're not overfitting, for example?

Ultimately, I'm just trying to reproduce your Reuters results to ensure that I get the same results that you published on my local environment. There are a few spots, though, where I'm struggling to fill in the gaps. For example, in the edgeConstruction.py file you mention that "PCA is computed for Text dataset. Please refer RCC paper for exact details". But I don't see anything in the RCC paper about Text dataset-specific PCA preprocessing. Am I missing something obvious?

Thanks for all your help!

shahsohil commented 5 years ago

Hi @scottfleming,

edgeConstruction.py is applied on the complete dataset. Let me clarify the steps:

  1. Train SDAE on train split. Validation set is used only as a check for overfitting or early stopping.
  2. In parallel (completely independent of above step) construct topology graph using edgeConstruction.py code. The data for the same is ordered as [trainset, testset]. Testset is same as validation set. This arrangement has no affect on the graph construction. This is done to maintain same ordering across all steps (here).
  3. Use extract_feature.py to extract all the embeddings. Again the embeddings are arranged as [trainset, testset] inside the code.
  4. Next use copy_graph.py code to merge (raw_features, embeddings, graph, cluster_labels) into a single file.

Regarding details of PCA, it is mentioned under Datasets section in the supplementary of RCC work.