[FEA] Affinity matrixes as precomputed metric input for clustering algorithms

erke-apoqlar commented 2 years ago

Is your feature request related to a problem? Please describe.

I wish I could use cuML Clustering algorithms with affinity matrixes. I'm mainly working with precomputed similarity matrixes that are built based on our specific relations among data. Currently only DBSCAN offers precomputed matrix input and it is not exactly

Describe the solution you'd like

Affinity matrix metric input is available for DBSCAN and HDBSCAN
Include other clustering algorithms with affinity matrix inputs: Affinity Propagation, Spectral Clustering

Describe alternatives you've considered I have considered using my raw data instead of creating the recomputed matrix. However, it doesn't work as expected.

beckernick commented 2 years ago

I have considered using my raw data instead of creating the recomputed matrix. However, it doesn't work as expected.

Would you be able to share a bit of info about what didn't work as expected when using the raw data?

erke-apoqlar commented 2 years ago

The problem is that our precomputed matrixes are not Euclidean distance matrixes, but they are similarity (affinity) matrixes.

When I inserted the raw data to any of the 4 available clustering algorithms on cuML, the resulting clustering was completely different than what we have achieved with Sci-kit's Affinity Propagation or Spectral Clustering Algorithms.

I made some research and I found out that also on Sci-kit's API documentation it is written only Affinity Propagation and Spectral Clustering accepts precomputed affinity (similarity matrixes).

So I can repeat my questions again:

Are you planning to make it available for using precomputed affinity (similarity matrixes) with the currently available clustering algorithms?
Are you planning to include Spectral and/or Affinity Propagation clustering algorithms to cuML?

Thank you very much! @

cjnolet commented 2 years ago

@erke-apoqlar,

You are absolutely right, HDBSCAN and DBSCAN expect distances and not similarities, as they work in the realm of smallest distances as opposed to largest similarities. In general, it could be reasonable enough to either flip the similarities into distances or perform an argmax instead of argmin in the algorithms like we do w/ the max inner product, but I think that will depend on how the affinity measure was computed.

To give us a better idea of the relative scale here, how many data points are you hoping to pre-compute? Usually spectral clustering works best on a graph or sparse affinity matrix so the full n^2 affinities don't need to be stored in memory all at once. Is your affinity matrix dense or sparse?

DBSCAN does support precomputed inputs but as you point out, it does expect distances and not similarities. HDBSCAN could support precomputed inputs but we have generally avoided this due to the relative scale. As an example, there's an option in both HDBSCAN and SLHC to compute pairwise distances instead of a knn but it caps out a 32gb GPU at around 500k data points.

As an aside- we do have a C++ version of spectral clustering that we use as a building block for some of our algorithms. We have a todo to expose this to through cuML's Python API but I can't guarantee when that will be ready. We also have the building blocks in place for affinity propagation but don't currently have it slated for a specific release.

erke-apoqlar commented 2 years ago

Hi again!

Thank you very much for your prompt answer. To give more background to the topic, we are working on image processing and voxels.

Lets say we define several properties for each voxel (e.g. brightness, location, color, ...). Then using our similarity comparison equations we compare each one with the other. Finally, we combine normalized similarity results with weighted coefficients to build our final similarity matrix.

Answers to your questions:

We already tried flipping into a distance matrix but it didn't give the result we expected.
Our affinity matrix is dense
We generally have 10-15 thousand data to compare among each other. So, the matrix has approximately 100.000-200.000 elements depending on the input image.

From our side:

We will have a look at the C++ source code
Do you have any other suggestion about creating our similarity matrix or an another way to transform it so that it can be also fed to the existing clustering algorithms?

rapidsai / cuml

[FEA] Affinity matrixes as precomputed metric input for clustering algorithms #4935