Open gabrielsantosrv opened 4 years ago
resolve the problem for search_snippet dataset. simply run the main.py.
Hello,
I resolve the problem for search_snippet dataset. The initial labels are generated from the nive clustering algorithm like k-means.. https://github.com/rashadulrakib/short-text-clustering-enhancement/commit/037e558d8841ed2a17e5b00354d9996ce672c661
Thanks for your interest.
On Mon, Apr 13, 2020 at 11:14 AM gabrielsantosrv notifications@github.com wrote:
Hello @rashadulrakib https://github.com/rashadulrakib, First of all, thanks for your work, I'm really interested in it. I'm having some problems to reproduce your reported results, and I also cannot run the code for the Search Snippets dataset, it seems to have some incorrect information in the file search_snippets_pred
I get the scores: StackOverflow dataset: acc (%): 68.15, nmi(%): 66.78
Biomedical dataset: acc (%): 46.77, nmi(%): 38.82
However, the reported results are: StackOverflow dataset: acc (%): 78.73±0.17, nmi(%): 73.44±0.35
Biomedical dataset: acc (%): 47.78±0.51, nmi(%): 41.27±0.36
PS. I'm executing the code using the data you have provided in the data directory.
Could you, please, help me to reproduce your results?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY3AC3DVNZLL5MMITZDRMMM5XANCNFSM4MG7O44A .
Thanks for your reply!
Since the initial clustering provided is k-means, and in your paper https://arxiv.org/pdf/2001.11631.pdf it is taken into account the Agglomerative Clustering and similarity distribution-based, how do I reproduce the HAC_SD_IC approach?
hello,
I would be nice, if i could provide you the code for HAC_SD_IC or the results of the algorithm. My codes are in three different language for HAC_SD_IC. I am sorry for that. represent each text by average vector using glove (300d) .create a n by n text similarity matrix. then sparsify it using algorithm2 . perform HAC on the sparsified matrix and get the clustering labels.
or simply, just run HAC on the n by n similarity matrix. it will also give you some competitive result.
I will try to run HAC_SD_IC, if i can.
thanks a lot for your interest.
On Mon, Apr 13, 2020 at 2:10 PM gabrielsantosrv notifications@github.com wrote:
Thanks for your answer!
How do I reproduce the HAC_SD_IC approach in this paper https://arxiv.org/pdf/2001.11631.pdf?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-612993390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY2PCGIOUZZKTTTVN33RMNBQLANCNFSM4MG7O44A .
Hello, Thanks a lot for your reply, it helped me understand how to run HAC.
I just have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it?
The reported results consider the entire dataset, or it is split into train/test datasets? I mean, have you split train/test sets before the initial clustering, and reported your results based on test sets?
------- I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it Yes. you are right
On Tue, Apr 14, 2020 at 7:03 PM gabrielsantosrv notifications@github.com wrote:
Hello,
I have another question regarding the HAC_SD. Should I run HAC_SD on the similarity matrix generating an initial clustering and only then execute the iterative classification in order to improve it?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-613703789, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAYZJBPCLSNTLGHJZ3VTRMTMRLANCNFSM4MG7O44A .
On the entire dataset
On Tue, Apr 14, 2020 at 7:19 PM gabrielsantosrv notifications@github.com wrote:
The reported results consider the entire dataset, or it is split into train/test datasets?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-613709444, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAYZM6JCJBXUAET6GFILRMTOOZANCNFSM4MG7O44A .
Hello,
Again thanks a lot for your replies and willingness to help me.
Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm?
hello,
i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly?
On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv notifications@github.com wrote:
Hello,
Again thanks a lot for your replies and willingness to help me.
Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-614188260, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A .
hello, i have used method='ward.D2'while clustering using Hierarchical clustering. I used fastcluster package in R. https://cran.r-project.org/web/packages/fastcluster/vignettes/fastcluster.pdf. Did i answer correctly? … On Wed, Apr 15, 2020 at 2:56 PM gabrielsantosrv @.***> wrote: Hello, Again thanks a lot for your replies and willingness to help me. Which criterion have you used to form the clusters from the dendrogram returned by the ward algorithm? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAY325PHLJJ3YKM7ZBULRMXYLFANCNFSM4MG7O44A .
Sure, I got it.
Hello, I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases?
I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them.
After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the ward clusterer from the fastcluster package in python
I got the following scores:
acc: 0.56935
nmi 0.49943
Do you have any suggestions to reach your results? Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification (https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were acc: 0.6480 nmi: 0.5948
Thanks!
Hello,
Sorry for the late reply. You can try to enhance your result ( acc: 0.56935 nmi 0.49943) through Itrative Classification. Sorry I can not answer your question now as i developed long time before.
can you please tell me which university you are from.
On Mon, Apr 20, 2020 at 4:10 PM gabrielsantosrv notifications@github.com wrote:
Hello, I'm trying to reproduce the HAC_SD algorithm, and I noticed, while extracting the glove embeddings of the texts from Stackoverflow dataset, some words are misspelled such as "featureactivated", "oraole", "navgation", etc. How do you deal with these cases?
I just ignored these words after removing the stop words, but it may lead to empty strings that I removed because I didn't know what to do with them. After running the HAC_SD considering the average of the Glove vectors for each non-stop word of the texts, and using the ward clusterer from the package fastcluster for python I got the following scores: acc: 0.56935 nmi 0.49943
Do you have any suggestions to reach your results? Your results reported on Improving Short Text Clustering by Similarity Matrix Sparsification (https://dl.acm.org/doi/pdf/10.1145/3209280.3229114?download=true) were acc: 0.6480 nmi: 0.5948
Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rashadulrakib/short-text-clustering-enhancement/issues/2#issuecomment-616752481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPBAYZUVJ22SBWMVLTIYJDRNSM35ANCNFSM4MG7O44A .
Hello,
It's ok ;) I'm just beginning to study short-text clustering, so I would like to reproduce your results as a state-of-the-art review, since they are quite good, it called my attention.
By the way, I'm from the University of Campinas (Unicamp)
Hello @rashadulrakib, First of all, thanks for your work, I'm really interested in it.
The file search_snippets_pred contains labels that haven't been defined in search_snippets_true_text. Could you please generate the file search_snippets_pred correctly and update it?
Moreover, I'm having some problems to reproduce your reported results. I've got the scores: StackOverflow dataset: acc (%): 68.15, nmi(%): 66.78
Biomedical dataset: acc (%): 46.77, nmi(%): 38.82
However, the reported results are: StackOverflow dataset: acc (%): 78.73±0.17, nmi(%): 73.44±0.35
Biomedical dataset: acc (%): 47.78±0.51, nmi(%): 41.27±0.36
PS. I'm executing the code using the data you have provided in the /data directory.
Could you, please, help me to reproduce your results? Thanks