Open shawn0wang opened 11 months ago
Ward linkage is our first priority right now. It is a bit more complicated because we need to finish a large refactor to be able to add more and more linkage functions. It will most certainly be ready this week.
BTW, I dont know if I choose Cosine distance, max_merge_distance is bigger
the result is better or max_merge_distance is smaller the result is better
I think The larger the Cosine distance, the more similar the two sentences are,
so if I set max_merge_distance more bigger, there will be more clustering categories , but it's not.
raising the max_merge_distance
will increase the number of clusters because you are loosening the criteria for a merge. This is the same as distance threshold
in agglomerative clustering: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
I'm looking forward to the Ward linkage and other features Thank you for providing this repository, it is very helpful for me !
Another question: When I have many CPU cores, my batch_size should be set larger. Will the calculation result be faster or should it be set smaller?
exp: if I have 100 CPU cores and 200,000 data ,what batch_size should be set will calculate faster
I create a new question about when the Ward Linkage will rdy ^-^, sometimes I often use Ward Linkage cause it result better than others
BTW, I dont know if I choose Cosine distance, max_merge_distance is bigger the result is better or max_merge_distance is smaller the result is better I think The larger the Cosine distance, the more similar the two sentences are, so if I set max_merge_distance more bigger, there will be more clustering categories , but it's not.
thanks