rapidsai / cuml

cuML - RAPIDS Machine Learning Library
Apache License 2.0
4.04k stars 521 forks source link

[BUG] UMAP random_state doesn't provide consistency #5892

Open chentitus opened 1 month ago

chentitus commented 1 month ago

Dear cuml team,

I am utilizing BERTopic for topic modeling. I understand that when I import UMAP from umap, and HDBSCAN from hdbscan, I can reproduce the results of topic modeling by setting random_state in UMAP.

But I realized that if I import HDBSCAN from cuml.cluster, and UMAP from cuml.manifold, then the results of topic modeling can no longer be replicated even when I set random_state in UMAP.

This is done on the Colab platform, and I upgrade BERTopic to 0.16.2.

Any ideas on how I can reproduce topic modeling results using cuml? Thanks much!

beckernick commented 1 month ago

The UMAP docstring indicates that random_state can't provide exact determinism but should provide consistency up to about 3 digits of precision.

@dantegd , possible we have a bug or the documentation is wrong?

import cuml
from sklearn.datasets import make_blobs

N = 1000

X, y = make_blobs(

NREP = 3
for i in range(NREP):
    reducer = cuml.manifold.umap.UMAP(
    X_t = reducer.fit_transform(X)
[[-2.5505848  -0.63661003]
 [-5.3669243  -0.07881355]
 [-4.428316    1.4433041 ]
 [-0.9989338  10.929661  ]
 [ 6.8667793  -9.262173  ]]

[[ -1.9667425   -2.6903896 ]
 [ -3.396501    -0.25006104]
 [ -1.6785622    0.13145828]
 [  3.3643045   11.314904  ]
 [ -2.0715647  -11.898888  ]]

[[  0.3823166    2.5653324 ]
 [  0.5335636   -0.0426445 ]
 [  2.2950068    0.81112003]
 [ -7.4286957   10.400803  ]
 [  8.3242235  -10.5068655 ]]
chentitus commented 1 month ago

Dear cuml team,

Another cuml-related issue has just popped up:

I need to know topic distribution of each document so I follow BERTopic instructions to implement approximate_distribution, but it returns with a ndarray containing nothing but 0s.

I have just realized that this issue may be due to cuml.

approximate_distribution can generate topic distribution if I use

from umap import UMAP
from hdbscan import HDBSCAN

But approximate_distribution returns with only 0s if I use

from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP

Any help or advice is much appreciated!

viclafargue commented 1 month ago

@beckernick I am not quite sure if it works with spectral initialization, could you try using init="random"?

cjnolet commented 1 month ago

That looks like a bug to me. Oddly, oddly we also have python tests for the reproducibility and those appear to be passing...

Victor's got a good point- it's very possible the spectral embedding is not honoring the random state and that's why we are using random init in the pytests.

beckernick commented 1 month ago

Looks like that's the bug:

import cuml
from sklearn.datasets import make_blobs

N = 1000

X, y = make_blobs(

NREP = 3
for i in range(NREP):
    reducer = cuml.manifold.umap.UMAP(
    X_t = reducer.fit_transform(X)
[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]

[[ -4.766629    8.464443 ]
 [  8.891461    1.2006083]
 [ -7.211566   -7.8680773]
 [ -5.811491  -12.208349 ]
 [ -6.8120937   7.2288113]]
viclafargue commented 2 weeks ago

Are we planning to fix spectral initialization already or should I open a PR to update the documentation regarding this limitation for now?

cc @cjnolet @dantegd