shahsohil / DCC

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper
MIT License
208 stars 53 forks source link

Clustering problems #25

Open silvia1993 opened 4 years ago

silvia1993 commented 4 years ago

Hello, thank you very much for sharing your project!

I'm trying to apply this algorithm on a set of RGB images (cartoons), in particular I have 2344 samples with dimension [227,227,3] composed by 7 classes. The algorithm is not able to correctly cluster the images, at the end I have ~ 0.2 ACC with 1220 clusters. I read carefully all the issues solved in this repository but I cannot solve my problem so I list each step that I did to have a feedback about a possibile mistake:

  1. I made my dataset using the file "make_data.py" using normalization [-1,1]. At the end I have testdata.mat and traindata.mat. Each row in this matrices is composed by the concatenation of the three channels, so I have [R,G,B] -> [51529,51529,51529] (51529=227x227). Considering together testdata.mat and traindata.mat I have a matrix 2344x154587.

  2. Next I run the "pretraining.py" file using --batch_size=256, --niter=1831 (in order to have 200 epochs as suggested), --step=733 (to have 80 epochs as suggested) --lr=0.01 (since the dimension of the data samples is higher than the other datasets used with this framework I though that this could be a good choice for mine), --dim=10.

  3. With the file checkpoint_4.pth.tar obtained after 2 I extract the features of the dataset obtaining "pretrained.pkl".

  4. I construct the graph with the original data using "edge_construction.py" with --algo knn, --k 10, --samples 2344 and I get "pretrained.mat" file.

  5. After I launch "copyGraph.py" to the final "pretrained.mat" file.

  6. Finally I use "DCC.py" leaving all the default values.

I tried also to use an higher k (k=20) and mknn instead of knn but the things seems not change. Do you have any idea about the reason why the algorithm not work properly with my data?

sumeromer commented 4 years ago

@silvia1993 : I have similar question, indeed. I search for a better architecture to use DCC losses because all the datasets (MNIST, YTF, Coil100, and YaleB) are toy datasets, and the current fully connected or convolutional architectures will not be enough to use 227x227 RGB images.

@shahsohil : Do you have any recommendations to try on ImageNet-like images? Did you experiment on them?

ilyak93 commented 3 years ago

@sumeromer, @silvia1993 I had another problem, with some similarity to yours, I always got a very dominant cluster of all, with the most of the data, and a lot singleton or just few examples cluster, did it happen to you ?

shsaronian commented 1 year ago

@sumeromer, @silvia1993 I had another problem, with some similarity to yours, I always got a very dominant cluster of all, with the most of the data, and a lot singleton or just few examples cluster, did it happen to you ?

It also happened to me, it's kind of like overfitting to all data points and clustering them all in a single group. Don't know if that makes sense as overfitting is mostly used in supervised algorithms.