rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.05k stars 523 forks source link

[BUG] metric parameter in cuml.UMAP not used? #5422

Open julenmendieta opened 1 year ago

julenmendieta commented 1 year ago

Describe the bug When I define a metric in cuml.UMAP() the output makes me think that it always uses the same one (maybe Euclidean) irrespective of what I stated

Steps/Code to reproduce bug

import sklearn.datasets
from cuml.neighbors import NearestNeighbors
from cuml.manifold import UMAP
import cudf
import cuml
import kmapper as km
import gc
from matplotlib import pyplot as plt

# Get some data
df_use, labels = sklearn.datasets.fetch_openml(
    'mnist_784', version=1, return_X_y=True, parser='auto'
)
n_neighbors=min(max([int(len(df_use.index)**(1/3)), 15]), len(df_use.index))

# STEP 1
# Here input distance works
m = NearestNeighbors(n_neighbors=n_neighbors, metric='chebyshev')
m.fit(df_use)
knn_graph = m.kneighbors_graph(df_use, n_neighbors=n_neighbors, mode='distance')
u = UMAP(n_neighbors=n_neighbors, n_components=2, random_state=42, 
         min_dist = 0.01, init = 'spectral', n_epochs = 2000)
standard_embedding = u.fit_transform(df_use, knn_graph=knn_graph)
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0], 
                'Y':standard_embedding[1]}, 
                x='X', y='Y', 
               s=0.3, alpha=0.5)
fig.suptitle("chebyshev Knn")
plt.show()

# STEP 2
# Here input distance does not work
reducer = cuml.UMAP(n_neighbors=n_neighbors,
                n_components=2,
                metric='chebyshev',
                low_memory=True,
               random_state=42,
                   min_dist = 0.01,
                   init = 'spectral',
                   n_epochs = 2000, verbose = False,
)
standard_embedding = reducer.fit_transform(df_use)
gc.collect()
# show UMAP
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0], 
                'Y':standard_embedding[1]}, 
                x='X', y='Y', 
               s=0.3, alpha=0.5)
fig.suptitle("chebyshev UMAP")
plt.show()

# STEP 3
# Use Euclidean on STEP 1 method to check similarity with STEP 2
m = NearestNeighbors(n_neighbors=n_neighbors, metric='euclidean')
m.fit(df_use)
knn_graph = m.kneighbors_graph(df_use, n_neighbors=n_neighbors, mode='distance')
u = UMAP(n_neighbors=n_neighbors, n_components=2, random_state=42, 
         min_dist = 0.01, init = 'spectral', n_epochs = 2000)
standard_embedding = u.fit_transform(df_use, knn_graph=knn_graph)
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0], 
                'Y':standard_embedding[1]}, 
                x='X', y='Y', 
               s=0.3, alpha=0.5)
fig.suptitle("Euclidean KNN")
plt.show()

# STEP 4
# Use Euclidean on STEP 2 method to show that UMAP does not change
reducer = cuml.UMAP(n_neighbors=n_neighbors,
                n_components=2,
                metric='cosine',
                low_memory=True,
               random_state=42,
                   min_dist = 0.01,
                   init = 'spectral',
                   n_epochs = 2000, verbose = False,
)
standard_embedding = reducer.fit_transform(df_use)
gc.collect()
# show UMAP
fig, ax = plt.subplots(1,1, figsize=(10,10))
sns.scatterplot(data={'X':standard_embedding[0], 
                'Y':standard_embedding[1]}, 
                x='X', y='Y', 
               s=0.3, alpha=0.5)
fig.suptitle("Euclidean UMAP")
plt.show()

Expected behavior I would expect a similar UMAP when using the code in STEP 1 and STEP 2

Environment details (please complete the following information):

Additional context cuml version: 23.04.01

I tried to look for similar issues but couldn't find them; apologies if I missed any

viclafargue commented 1 year ago

Thank you for noticing this issue. There was indeed a problem with the use of the metric in the dense case inside of the UMAP implementation. Just opened a fix for it.

julenmendieta commented 1 year ago

Thanks to you for such a great package :)