shahsohil / DCC

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper
MIT License
208 stars 53 forks source link

a question about mknn #18

Closed jiankang1991 closed 5 years ago

jiankang1991 commented 5 years ago

Hi @shahsohil , the work is very interesting! I have a question about the construction of mKNN graph. In the project, I find that you use the original data to measure the similarity and construct the mkNN graph. Is there a particular reason here? Why not use the latent representation feature of the pretrained AE for the graph? If I have a large image patch, e.g. 256x256 with multiple input bands, it will be a large computation cost for the creation of graph in the original image space.

Thank you.

shahsohil commented 5 years ago

Hi @jiankang1991 , good question. When one pretrains AE just for reconstruction loss, it is highly unlikely that the underlying topology is preserved in the latent code space. In other words, there is no guarantee that the AE features would represent much better clustering prior than the original raw features. In my experiments for most datasets discussed in the paper, I found that the use of AE features for graph construction impaired the final quantitative results.

However, in your case, due to large feature size I should agree with you that it makes more sense to use latent features for graph construction rather original features. Moreover, the curse of dimensionality will be largely at play in this case.

jiankang1991 commented 5 years ago

Thank you for your reply. For the hyperparameters, which parameter should be tuned when the method is adopted to a new image dataset, besides the number of neighborhood? Since I run it on my dataset, after about 500 epochs, there are still a thousand of clusters. I normalize the dataset in the range of 0 to 1. Do you have any other suggestions?

Thank you very much.

shahsohil commented 5 years ago

Major hyper-parameter is k-NN neighbours. Try increasing 'k'. Also normalise the dataset in the range [-1, 1]. Can you also share plot showing the num of clusters w.r.t. training epochs ?

jiankang1991 commented 5 years ago

I did not save it. As I remembered, the number of images I have is 27000, it decreases gradually from about 26990 clusters in the first epoch, until 1000 clusters in the 500 epoch. Every epoch, the number of clusters decrease at about 20-100. The ground truth cluster number is 10.

jiankang1991 commented 5 years ago

The number of K is 10