Open chentitus opened 1 month ago
The UMAP docstring indicates that random_state
can't provide exact determinism but should provide consistency up to about 3 digits of precision.
@dantegd , possible we have a bug or the documentation is wrong?
import cuml
from sklearn.datasets import make_blobs
N = 1000
X, y = make_blobs(
n_samples=N
)
NREP = 3
for i in range(NREP):
reducer = cuml.manifold.umap.UMAP(
random_state=12
)
X_t = reducer.fit_transform(X)
print(reducer.random_state)
print(X_t[:5])
print()
662124363
[[-2.5505848 -0.63661003]
[-5.3669243 -0.07881355]
[-4.428316 1.4433041 ]
[-0.9989338 10.929661 ]
[ 6.8667793 -9.262173 ]]
662124363
[[ -1.9667425 -2.6903896 ]
[ -3.396501 -0.25006104]
[ -1.6785622 0.13145828]
[ 3.3643045 11.314904 ]
[ -2.0715647 -11.898888 ]]
662124363
[[ 0.3823166 2.5653324 ]
[ 0.5335636 -0.0426445 ]
[ 2.2950068 0.81112003]
[ -7.4286957 10.400803 ]
[ 8.3242235 -10.5068655 ]]
Dear cuml team,
Another cuml-related issue has just popped up:
I need to know topic distribution of each document so I follow BERTopic instructions to implement approximate_distribution, but it returns with a ndarray containing nothing but 0s.
I have just realized that this issue may be due to cuml.
approximate_distribution can generate topic distribution if I use
from umap import UMAP
from hdbscan import HDBSCAN
But approximate_distribution returns with only 0s if I use
from cuml.cluster import HDBSCAN
from cuml.manifold import UMAP
Any help or advice is much appreciated!
@beckernick I am not quite sure if it works with spectral initialization, could you try using init="random"
?
That looks like a bug to me. Oddly, oddly we also have python tests for the reproducibility and those appear to be passing...
Victor's got a good point- it's very possible the spectral embedding is not honoring the random state and that's why we are using random init in the pytests.
Looks like that's the bug:
import cuml
from sklearn.datasets import make_blobs
N = 1000
X, y = make_blobs(
n_samples=N
)
NREP = 3
for i in range(NREP):
reducer = cuml.manifold.umap.UMAP(
random_state=12,
init="random"
)
X_t = reducer.fit_transform(X)
print(reducer.random_state)
print(X_t[:5])
print()
662124363
[[ -4.766629 8.464443 ]
[ 8.891461 1.2006083]
[ -7.211566 -7.8680773]
[ -5.811491 -12.208349 ]
[ -6.8120937 7.2288113]]
662124363
[[ -4.766629 8.464443 ]
[ 8.891461 1.2006083]
[ -7.211566 -7.8680773]
[ -5.811491 -12.208349 ]
[ -6.8120937 7.2288113]]
662124363
[[ -4.766629 8.464443 ]
[ 8.891461 1.2006083]
[ -7.211566 -7.8680773]
[ -5.811491 -12.208349 ]
[ -6.8120937 7.2288113]]
Are we planning to fix spectral initialization already or should I open a PR to update the documentation regarding this limitation for now?
cc @cjnolet @dantegd
Dear cuml team,
I am utilizing BERTopic for topic modeling. I understand that when I import UMAP from umap, and HDBSCAN from hdbscan, I can reproduce the results of topic modeling by setting random_state in UMAP.
But I realized that if I import HDBSCAN from cuml.cluster, and UMAP from cuml.manifold, then the results of topic modeling can no longer be replicated even when I set random_state in UMAP.
This is done on the Colab platform, and I upgrade BERTopic to 0.16.2.
Any ideas on how I can reproduce topic modeling results using cuml? Thanks much!