lbehnke / hierarchical-clustering-java

Implementation of an agglomerative hierarchical clustering algorithm in Java. Different linkage approaches are supported.
141 stars 79 forks source link

performance #6

Closed tyrcho closed 9 years ago

tyrcho commented 9 years ago

When I use this with more than 2000 elements, the performance is really slow (16 seconds for 2000, several minutes for 3000).

It seems the algorithm has a O(n^3) complexity.

The best algorithm (SLINK : http://en.wikipedia.org/wiki/Single-linkage_clustering) for this problem are O(n²).

Any chance you would implement the faster version ?

lbehnke commented 9 years ago

When I looked into issue #9 yesterday, I noticed some suspicious code in ClusterPair.agglomerate(). Creating a cluster name that descibes a position in a tree can get very expensive when working with many elements. For debugging this is great, but otherwise it is overkill. I'll try to find an alternative naming strategy.

barrybecker4 commented 5 years ago

@lbehnke, I think that this issue should be reopened. I am seeing the same problem as originally reported by @tyrcho. In fact, running with more than 3000 elements gives an error like:

An exception or error caused a run to abort: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded at com.apporiented.algorithm.clustering.DistanceMap.hashCodePairNames(DistanceMap.java:107) at com.apporiented.algorithm.clustering.DistanceMap.hashCodePair(DistanceMap.java:100)

I did some performance tests, and it looks like the times are increasing faster than n^3 as n gets larger. In the image below, I bound the actual timing results by n^2 and (n-1)^3 + 3.

h-cluster

I am using val alg = new PDistClusteringAlgorithm val cluster = alg.performClustering(distances, words, new CompleteLinkageStrategy())

barrybecker4 commented 5 years ago

I profiled the code and found that 99% of the time was being spent in data.remove(remove); in DistanceMap.remove(ClusterPair link) If we simply comment that line out, and not worry about removing items from the priority queue while the agglomeration is happening, things will run much faster. I don't think its that urgent to do the remove because the entire DistanceMap will be reclaimed once the agglomeration is done. There are still memory issues for large N, but for much larger N than without this change. Clustering of 2048 nodes took over 16 minutes before this change, but only 10 seconds after.

h-cluster-imp

Here are some numbers to show the drastic improvement. The numbers also seem to indicate approximately N^2 log N runtime performance up until memory issues take over - which is expected.