Describe the bug
When I define a metric in cuml.UMAP() the output makes me think that it always uses the same one (maybe Euclidean) irrespective of what I stated
Steps/Code to reproduce bug
import sklearn.datasets
from cuml.neighbors import NearestNeighbors
from cuml.manifold import UMAP
import cudf
import cuml
import kmapper as km
import gc
from matplotlib import pyplot as plt
# Get some data
df_use, labels = sklearn.datasets.fetch_openml(
'mnist_784', version=1, return_X_y=True, parser='auto'
)
n_neighbors=min(max([int(len(df_use.index)**(1/3)), 15]), len(df_use.index))
# STEP 1
# Here input distance works
m = NearestNeighbors(n_neighbors=n_neighbors, metric='chebyshev')
m.fit(df_use)
knn_graph = m.kneighbors_graph(df_use, n_neighbors=n_neighbors, mode='distance')
u = UMAP(n_neighbors=n_neighbors, n_components=2, random_state=42,
min_dist = 0.01, init = 'spectral', n_epochs = 2000)
standard_embedding = u.fit_transform(df_use, knn_graph=knn_graph)
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0],
'Y':standard_embedding[1]},
x='X', y='Y',
s=0.3, alpha=0.5)
fig.suptitle("chebyshev Knn")
plt.show()
# STEP 2
# Here input distance does not work
reducer = cuml.UMAP(n_neighbors=n_neighbors,
n_components=2,
metric='chebyshev',
low_memory=True,
random_state=42,
min_dist = 0.01,
init = 'spectral',
n_epochs = 2000, verbose = False,
)
standard_embedding = reducer.fit_transform(df_use)
gc.collect()
# show UMAP
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0],
'Y':standard_embedding[1]},
x='X', y='Y',
s=0.3, alpha=0.5)
fig.suptitle("chebyshev UMAP")
plt.show()
# STEP 3
# Use Euclidean on STEP 1 method to check similarity with STEP 2
m = NearestNeighbors(n_neighbors=n_neighbors, metric='euclidean')
m.fit(df_use)
knn_graph = m.kneighbors_graph(df_use, n_neighbors=n_neighbors, mode='distance')
u = UMAP(n_neighbors=n_neighbors, n_components=2, random_state=42,
min_dist = 0.01, init = 'spectral', n_epochs = 2000)
standard_embedding = u.fit_transform(df_use, knn_graph=knn_graph)
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0],
'Y':standard_embedding[1]},
x='X', y='Y',
s=0.3, alpha=0.5)
fig.suptitle("Euclidean KNN")
plt.show()
# STEP 4
# Use Euclidean on STEP 2 method to show that UMAP does not change
reducer = cuml.UMAP(n_neighbors=n_neighbors,
n_components=2,
metric='cosine',
low_memory=True,
random_state=42,
min_dist = 0.01,
init = 'spectral',
n_epochs = 2000, verbose = False,
)
standard_embedding = reducer.fit_transform(df_use)
gc.collect()
# show UMAP
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0],
'Y':standard_embedding[1]},
x='X', y='Y',
s=0.3, alpha=0.5)
fig.suptitle("Euclidean UMAP")
plt.show()
Expected behavior
I would expect a similar UMAP when using the code in STEP 1 and STEP 2
Environment details (please complete the following information):
Environment location:
Linux Distro/Architecture: Ubuntu 20.04.3 LTS x86_64
Thank you for noticing this issue. There was indeed a problem with the use of the metric in the dense case inside of the UMAP implementation. Just opened a fix for it.
Describe the bug When I define a metric in cuml.UMAP() the output makes me think that it always uses the same one (maybe Euclidean) irrespective of what I stated
Steps/Code to reproduce bug
Expected behavior I would expect a similar UMAP when using the code in STEP 1 and STEP 2
Environment details (please complete the following information):
Additional context cuml version: 23.04.01
I tried to look for similar issues but couldn't find them; apologies if I missed any