rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.15k stars 526 forks source link

[FEA] Cosine distance for Kmeans MNMG #5960

Open barius opened 2 months ago

barius commented 2 months ago

Is your feature request related to a problem? Please describe. I'm fitting a Kmeans model on a 1 billion * 1024 dims dataset using MNMG Kmeans. Cosine distance is preferred since the semantic embedding model I use typically use cosine. The current cuML Kmeans python API (single CPU or Dask cluster) does not expose a way to specify distance metrics.

Describe the solution you'd like Like DBSCAN, expose a metrics argument in cuml.dask.cluster.kmeans.KMeans and cuml.cluster.KMeans.

Describe alternatives you've considered I could preprocess the data to normalize the features but the centroids are not normalized during fit().

Additional context There is an issue about the C++ API: https://github.com/rapidsai/cuml/issues/563 but I'm not sure how to invoke it.

barius commented 2 months ago

could i simply monkey patch https://github.com/rapidsai/cuml/blob/f17f89976c197cffc9fe5794d4cf0f846116a0cc/python/cuml/cluster/kmeans.pyx#L204 here to modify the metric and then recompile the pyx files?

dantegd commented 1 month ago

Thanks for the request @barius. Monkey patching will not work with a Cython file like that, also I'm not 100% sure if kmeans on the C++ layer already supports it. The C++ code lives in https://github.com/rapidsai/raft, @divyegala or @cjnolet would be able to answer about the status of cosine in kmeans.