Open erke-apoqlar opened 2 years ago
I have considered using my raw data instead of creating the recomputed matrix. However, it doesn't work as expected.
Would you be able to share a bit of info about what didn't work as expected when using the raw data?
The problem is that our precomputed matrixes are not Euclidean distance matrixes, but they are similarity (affinity) matrixes.
When I inserted the raw data to any of the 4 available clustering algorithms on cuML, the resulting clustering was completely different than what we have achieved with Sci-kit's Affinity Propagation or Spectral Clustering Algorithms.
I made some research and I found out that also on Sci-kit's API documentation it is written only Affinity Propagation and Spectral Clustering accepts precomputed affinity (similarity matrixes).
So I can repeat my questions again:
Thank you very much! @
@erke-apoqlar,
You are absolutely right, HDBSCAN and DBSCAN expect distances and not similarities, as they work in the realm of smallest distances as opposed to largest similarities. In general, it could be reasonable enough to either flip the similarities into distances or perform an argmax instead of argmin in the algorithms like we do w/ the max inner product, but I think that will depend on how the affinity measure was computed.
To give us a better idea of the relative scale here, how many data points are you hoping to pre-compute? Usually spectral clustering works best on a graph or sparse affinity matrix so the full n^2 affinities don't need to be stored in memory all at once. Is your affinity matrix dense or sparse?
DBSCAN does support precomputed inputs but as you point out, it does expect distances and not similarities. HDBSCAN could support precomputed inputs but we have generally avoided this due to the relative scale. As an example, there's an option in both HDBSCAN and SLHC to compute pairwise distances instead of a knn but it caps out a 32gb GPU at around 500k data points.
As an aside- we do have a C++ version of spectral clustering that we use as a building block for some of our algorithms. We have a todo to expose this to through cuML's Python API but I can't guarantee when that will be ready. We also have the building blocks in place for affinity propagation but don't currently have it slated for a specific release.
Hi again!
Thank you very much for your prompt answer. To give more background to the topic, we are working on image processing and voxels.
Lets say we define several properties for each voxel (e.g. brightness, location, color, ...). Then using our similarity comparison equations we compare each one with the other. Finally, we combine normalized similarity results with weighted coefficients to build our final similarity matrix.
Answers to your questions:
From our side:
Is your feature request related to a problem? Please describe.
I wish I could use cuML Clustering algorithms with affinity matrixes. I'm mainly working with precomputed similarity matrixes that are built based on our specific relations among data. Currently only DBSCAN offers precomputed matrix input and it is not exactly
Describe the solution you'd like
Describe alternatives you've considered I have considered using my raw data instead of creating the recomputed matrix. However, it doesn't work as expected.