porterehunley / RACplusplus

A high performance implementation of Reciprocal Agglomerative Clustering in C++
MIT License
50 stars 2 forks source link

About Ward Linkage #2

Open shawn0wang opened 11 months ago

shawn0wang commented 11 months ago

I create a new question about when the Ward Linkage will rdy ^-^, sometimes I often use Ward Linkage cause it result better than others

BTW, I dont know if I choose Cosine distance, max_merge_distance is bigger the result is better or max_merge_distance is smaller the result is better I think The larger the Cosine distance, the more similar the two sentences are, so if I set max_merge_distance more bigger, there will be more clustering categories , but it's not.

thanks

porterehunley commented 11 months ago

Ward linkage is our first priority right now. It is a bit more complicated because we need to finish a large refactor to be able to add more and more linkage functions. It will most certainly be ready this week.

BTW, I dont know if I choose Cosine distance, max_merge_distance is bigger 
the result is better or max_merge_distance is smaller the result is better
I think The larger the Cosine distance, the more similar the two sentences are, 
so if I set max_merge_distance more bigger, there will be more clustering categories , but it's not.

raising the max_merge_distance will increase the number of clusters because you are loosening the criteria for a merge. This is the same as distance threshold in agglomerative clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

shawn0wang commented 11 months ago

I'm looking forward to the Ward linkage and other features Thank you for providing this repository, it is very helpful for me !

shawn0wang commented 11 months ago

Another question: When I have many CPU cores, my batch_size should be set larger. Will the calculation result be faster or should it be set smaller?

exp: if I have 100 CPU cores and 200,000 data ,what batch_size should be set will calculate faster