shahsohil / DCC

This repository contains the source code and data for reproducing results of Deep Continuous Clustering paper
MIT License
208 stars 53 forks source link

can we train and see the clusters with unlabeled data? #1

Closed tiru1930 closed 6 years ago

tiru1930 commented 6 years ago

Hi

I am currently working on clustering of my custom textual data, in which i don't have any pre-defined labels.So i just have tried with code which is available,where i have done few changes in the code for not to consider labels. while making input for the training,put it gave below error.

File "pretraining.py", line 83, in main 'nepoch':nepoch, 'lrate':[args.lr], 'wdecay':[0.0], 'step':step}, use_cuda, trainloader, testloader) File "pretraining.py", line 171, in pretrain train(trainloader, net, index, optimizer, epoch, use_cuda) File "pretraining.py", line 192, in train outputs = net(inputs_Var, index) File "/home/tiru/Desktop/topicmodel/topicmodel/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/tiru/Desktop/topicmodel/DCC/pytorch/SDAE.py", line 34, in forward inp = x.view(-1, self.in_dim) RuntimeError: invalid argument 2: size '[-1 x 2000]' is invalid for input with 324864 elements at /pytorch/torch/lib/TH/THStorage.c:37

So is it possible to train on text data which is not having any labels.if yes how can we train and test the same.

shahsohil commented 6 years ago

First of all the clustering algorithm and its module in the code do not require any label input. Please see, https://github.com/shahsohil/DCC/blob/00d420d41e5f9de38d8defde5820e77e4ce914b2/pytorch/DCC.py#L246-L248 As you can notice here, https://github.com/shahsohil/DCC/blob/00d420d41e5f9de38d8defde5820e77e4ce914b2/pytorch/DCC.py#L290-L291 labels were utilized only during the evaluation and that too during the last few epochs of the total DCC training.

Secondly, DCC code can complete the run even without label input and without modifying a single line of code. Just fill up dummy labels i.e., all zeros/ones/random for the target variable in the input file pretained.mat

Finally, coming to the error you are noticing - it seems to me that your input is of dimension 1,269 instead of 2,000. Please verify your input file.

tiru1930 commented 6 years ago

Thank you On Fri, Mar 16, 2018, 8:52 PM Sohil Shah notifications@github.com wrote:

First of all the clustering algorithm and its module in the code do not require any label input. Please see,

https://github.com/shahsohil/DCC/blob/00d420d41e5f9de38d8defde5820e77e4ce914b2/pytorch/DCC.py#L246-L248 As you can notice here,

https://github.com/shahsohil/DCC/blob/00d420d41e5f9de38d8defde5820e77e4ce914b2/pytorch/DCC.py#L290-L291 labels were utilized only during the evaluation and that too during the last few epochs of the total DCC training.

Secondly, DCC code can complete the run even without label input and without modifying a single line of code. Just fill up dummy labels i.e., all zeros/ones/random for the target variable in the input file pretained.mat

Finally, coming to the error you are noticing - it seems to me that your input is of dimension 1,269 instead of 2,000. Please verify your input file.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shahsohil/DCC/issues/1#issuecomment-373747325, or mute the thread https://github.com/notifications/unsubscribe-auth/ALpUV4niUddm_GdL8O45-RPk3HBmOZd5ks5te9i9gaJpZM4StVKV .