Open barius opened 2 months ago
could i simply monkey patch https://github.com/rapidsai/cuml/blob/f17f89976c197cffc9fe5794d4cf0f846116a0cc/python/cuml/cluster/kmeans.pyx#L204 here to modify the metric and then recompile the pyx files?
Thanks for the request @barius. Monkey patching will not work with a Cython file like that, also I'm not 100% sure if kmeans on the C++ layer already supports it. The C++ code lives in https://github.com/rapidsai/raft, @divyegala or @cjnolet would be able to answer about the status of cosine in kmeans.
Is your feature request related to a problem? Please describe. I'm fitting a Kmeans model on a 1 billion * 1024 dims dataset using MNMG Kmeans. Cosine distance is preferred since the semantic embedding model I use typically use cosine. The current cuML Kmeans python API (single CPU or Dask cluster) does not expose a way to specify distance metrics.
Describe the solution you'd like Like DBSCAN, expose a
metrics
argument incuml.dask.cluster.kmeans.KMeans
andcuml.cluster.KMeans
.Describe alternatives you've considered I could preprocess the data to normalize the features but the centroids are not normalized during
fit()
.Additional context There is an issue about the C++ API: https://github.com/rapidsai/cuml/issues/563 but I'm not sure how to invoke it.