rashadulrakib / short-text-clustering-enhancement

32 stars 20 forks source link

Problems to reproduce the results #2

Open gabrielsantosrv opened 4 years ago

gabrielsantosrv commented 4 years ago

Hello @rashadulrakib, First of all, thanks for your work, I'm really interested in it.

The file search_snippets_pred contains labels that haven't been defined in search_snippets_true_text. Could you please generate the file search_snippets_pred correctly and update it?

Moreover, I'm having some problems to reproduce your reported results. I've got the scores: StackOverflow dataset: acc (%): 68.15, nmi(%): 66.78

Biomedical dataset: acc (%): 46.77, nmi(%): 38.82

However, the reported results are: StackOverflow dataset: acc (%): 78.73±0.17, nmi(%): 73.44±0.35

Biomedical dataset: acc (%): 47.78±0.51, nmi(%): 41.27±0.36

PS. I'm executing the code using the data you have provided in the /data directory.

Could you, please, help me to reproduce your results? Thanks

rashadulrakib commented 4 years ago

resolve the problem for search_snippet dataset. simply run the main.py.

rashadulrakib commented 4 years ago

Hello,

I resolve the problem for search_snippet dataset. The initial labels are generated from the nive clustering algorithm like k-means.. https://github.com/rashadulrakib/short-text-clustering-enhancement/commit/037e558d8841ed2a17e5b00354d9996ce672c661

Thanks for your interest.

On Mon, Apr 13, 2020 at 11:14 AM gabrielsantosrv notifications@github.com wrote:

Hello @rashadulrakib https://github.com/rashadulrakib, First of all, thanks for your work, I'm really interested in it. I'm having some problems to reproduce your reported results, and I also cannot run the code for the Search Snippets dataset, it seems to have some incorrect information in the file search_snippets_pred

I get the scores: StackOverflow dataset: acc (%): 68.15, nmi(%): 66.78

Biomedical dataset: acc (%): 46.77, nmi(%): 38.82

However, the reported results are: StackOverflow dataset: acc (%): 78.73±0.17, nmi(%): 73.44±0.35

Biomedical dataset: acc (%): 47.78±0.51, nmi(%): 41.27±0.36

PS. I'm executing the code using the data you have provided in the data directory.

Could you, please, help me to reproduce your results?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY3AC3DVNZLL5MMITZDRMMM5XANCNFSM4MG7O44A .

gabrielsantosrv commented 4 years ago

Thanks for your reply!

Since the initial clustering provided is k-means, and in your paper https://arxiv.org/pdf/2001.11631.pdf it is taken into account the Agglomerative Clustering and similarity distribution-based, how do I reproduce the HAC_SD_IC approach?

rashadulrakib commented 4 years ago

hello,

I would be nice, if i could provide you the code for HAC_SD_IC or the results of the algorithm. My codes are in three different language for HAC_SD_IC. I am sorry for that. represent each text by average vector using glove (300d) .create a n by n text similarity matrix. then sparsify it using algorithm2 . perform HAC on the sparsified matrix and get the clustering labels.

or simply, just run HAC on the n by n similarity matrix. it will also give you some competitive result.

I will try to run HAC_SD_IC, if i can.

thanks a lot for your interest.

On Mon, Apr 13, 2020 at 2:10 PM gabrielsantosrv notifications@github.com wrote:

Thanks for your answer!

How do I reproduce the HAC_SD_IC approach in this paper https://arxiv.org/pdf/2001.11631.pdf?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-612993390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY2PCGIOUZZKTTTVN33RMNBQLANCNFSM4MG7O44A .

gabrielsantosrv commented 4 years ago

Hello, Thanks a lot for your reply, it helped me understand how to run HAC.

I just have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it?

gabrielsantosrv commented 4 years ago

The reported results consider the entire dataset, or it is split into train/test datasets? I mean, have you split train/test sets before the initial clustering, and reported your results based on test sets?

rashadulrakib commented 4 years ago

------- I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it Yes. you are right

On Tue, Apr 14, 2020 at 7:03 PM gabrielsantosrv notifications@github.com wrote:

Hello,

I have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-613703789, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAYZJBPCLSNTLGHJZ3VTRMTMRLANCNFSM4MG7O44A .

rashadulrakib commented 4 years ago

On the entire dataset

On Tue, Apr 14, 2020 at 7:19 PM gabrielsantosrv notifications@github.com wrote:

The reported results consider the entire dataset, or it is split into train/test datasets?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-613709444, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAYZM6JCJBXUAET6GFILRMTOOZANCNFSM4MG7O44A .

gabrielsantosrv commented 4 years ago

Hello,

Again thanks a lot for your replies and willingness to help me.

Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm?

rashadulrakib commented 4 years ago

hello,

i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly?

On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv notifications@github.com wrote:

Hello,

Again thanks a lot for your replies and willingness to help me.

Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-614188260, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A .

gabrielsantosrv commented 4 years ago

hello, i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly? On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv @.***> wrote: Hello, Again thanks a lot for your replies and willingness to help me. Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A .

Sure, I got it.

gabrielsantosrv commented 4 years ago

Hello, I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases?

I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them.
After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the ward clusterer from the fastcluster package in python I got the following scores: acc: 0.56935 nmi 0.49943

Do you have any suggestions to reach your results? Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification (https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were acc: 0.6480 nmi: 0.5948

Thanks!

rashadulrakib commented 4 years ago

Hello,

Sorry for the late reply. You can try to enhance your result ( acc: 0.56935 nmi 0.49943) through Itrative Classification. Sorry I can not answer your question now as i developed long time before.

can you please tell me which university you are from.

On Mon, Apr 20, 2020 at 4:10 PM gabrielsantosrv notifications@github.com wrote:

Hello, I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases?

I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them. After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the ward clusterer from the package fastcluster for python I got the following scores: acc: 0.56935 nmi 0.49943

Do you have any suggestions to reach your results? Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification (https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were acc: 0.6480 nmi: 0.5948

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-616752481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAYZUVJ22SBWMVLTIYJDRNSM35ANCNFSM4MG7O44A .

gabrielsantosrv commented 4 years ago

Hello,

It's ok ;) I'm just beginning to study short-text clustering, so I would like to reproduce your results as a state-of-the-art review, since they are quite good, it called my attention.

By the way, I'm from the University of Campinas (Unicamp)