rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.06k stars 525 forks source link

[BUG] UMAP spectral initialization fails to preserve global structure. #5782

Open kc-howe opened 5 months ago

kc-howe commented 5 months ago

Describe the bug

UMAP spectral initialization yields unexpected initial layout results. Consequently, global structure of the input data often is not preserved, even when a very high n_epochs parameter is used. If UMAP is used as a pre-processing step for clustering, this behavior can impact results significantly, depending on the geometry of the input data.

Steps/Code to reproduce bug

The code below provides a minimal example which exhibits the issue on scikit-learn's make_circles dataset. Since the results of the cuML UMAP implementation may vary despite setting a random_state seed, one may have to repeatedly run the code below in order to get a result which demonstrates the problematic behavior.

import cudf
import cupy
import plotly.express as px
from cuml import UMAP as GPU_UMAP
from plotly.subplots import make_subplots
from sklearn.datasets import make_circles
from umap import UMAP as CPU_UMAP

IMAGES_DIRECTORY = './images'

# Create datasets
X_circles, y_circles = make_circles(1000, random_state=42)

# CPU embedding
cpu_mapper = CPU_UMAP(
    n_neighbors=15,
    n_components=2,
    n_epochs=2000,
    init='spectral',
    learning_rate=1.0,
    random_state=0
)
cpu_embedding = cpu_mapper.fit_transform(X_circles)
cpu_embedding_df = cudf.DataFrame(cpu_embedding, columns=['x','y'])
cpu_embedding_df['label'] = y_circles.astype(str)

# GPU embedding
gpu_mapper = GPU_UMAP(
    n_neighbors=15,
    n_components=2,
    n_epochs=2000,
    init='spectral',
    learning_rate=1.0,
    random_state=0
)
gpu_embedding = gpu_mapper.fit_transform(cupy.asarray(X_circles))
gpu_embedding_df = cudf.DataFrame(gpu_embedding, columns=['x','y'])
gpu_embedding_df['label'] = y_circles.astype(str)

# Make plots
cpu_plot = px.scatter(
    cpu_embedding_df,
    x='x',
    y='y',
    color='label',
    title='Circles UMAP Embedding (CPU)'
)

gpu_plot = px.scatter(
    gpu_embedding_df,
    x='x',
    y='y',
    color='label',
    title='Circles UMAP Embedding (GPU)'
)

fig = make_subplots(1, 2, subplot_titles=['CPU Embedding', 'GPU Embedding'])
for trace in cpu_plot['data']:
    fig.append_trace(trace, row=1, col=1)
for trace in gpu_plot['data']:
    fig.append_trace(trace, row=1, col=2)

fig.update_layout(
    title_text='UMAP Spectral Init Global Structure Comparison',
    width=900,
    height=500,
    showlegend=False
)

fig.write_image(f'{IMAGES_DIRECTORY}/circles_embedding.png')

The resulting plot should look something like this:

circles_embedding

Expected behavior

Spectral initialization should yield similar global structure as the CPU version of UMAP in the initial layout and consequently after layout optimization. For the most part at least, spatially distinct connected components of the input data should stay separated after embedding; UMAP with spectral initialization should not make clustering such cases more difficult (see rings example below).

The results of CPU and GPU UMAP should not be identical, of course, as there are understandably some implementation differences, particularly in regards to spectral embedding. However, general behavior with respect to global structure preservation under spectral initialization should be the same.

The following figure demonstrates the extreme difference in spectral initialization behavior between CPU and GPU UMAP on scikit-learn's make_blobs. Note that it seems impossible to set n_epochs to precisely 0 in the cuML implementation without invoking default values, so a minimal value of 1 is used below. (Apologies for the small text, CPU results are in the top row, GPU results are in the bottom row.)

umap_embedding_cpu_gpu_blobs_epochs1_init=spectral

For a dataset like make_blobs, the initialization is very different, but the CPU and GPU results can generally be made the same by running at a high number of epochs. Here is a similar figure demonstrating the difference on a dataset comprised of three non-concentric rings, which cannot be improved by setting high n_epochs (see further below):

three_rings_dataset

umap_embedding_cpu_gpu_equidistant_rings_epochs=1_init=spectral

I do not suspect that the difference in spectral initialization results can always be resolved by raising the n_epochs parameter, as has been suggested as a potential resolution to similar issues reported by other users (e.g. #5474). Since the low-dimensional layout optimization only acts on KNN-localized edge weights, I don't expect that any number of epochs would promise the recovery of global structure. To be sure, we can check that at n_epochs values of 500 and 2000 we observe the same difference in behavior between CPU and GPU UMAP for the three rings dataset above.

umap_embedding_cpu_gpu_equidistant_rings_epochs=500_init=spectral

umap_embedding_cpu_gpu_equidistant_rings_epochs=2000_init=spectral

Environment details (please complete the following information):

Additional context

I initially felt this could be related to the known issue with the Laplacian eigenmaps solver mentioned in the comments of #5474, however the differences in results compared to the CPU solvers seem somewhat extreme.

I am also aware that CPU UMAP handles spectral layout of networks with multiple components somewhat differently than single-component networks. However, datasets which yield single-component graphs may exhibit the same behavior as above, e.g. when embedding a single ring.

I am happy to provide any additional code, examples, or environment details upon request.

Additionally, thank you all for the incredible work you do on this repository, and in particular for bringing UMAP to the GPU. You guys are phenomenal, and your efforts here are so deeply appreciated!

dantegd commented 4 months ago

Thanks for the issue @kc-howe! We have identified a few ill-behaviors and issues with spectral clustering from RAFT that affect UMAP in particular. We will be working on solving them, but we don't have an ETA yet, but is in our roadmap as we work on RAFT, cuML and cuVS in the next few releases.

beckernick commented 1 month ago

For anyone who comes across on this issue, please see discussion in https://github.com/rapidsai/cuml/issues/5707#issuecomment-2115156120

Hey everyone,

Sorry for being late to this discussion. I think one of the problems here might be the assumption that cuML's UMAP will always yield the exact same results as the CPU-based reference implementation for the exact same parameter settings. We did some parallelism magic on the GPU side to speed up the algorithm and as a result, it's possible that some of the parameters (such as the number of iterations for the solver) might need to be tweaked a bit.

In addition, the underlying GPU-accelerated spectral embedding initialization primitive has gotten fairly old by this point and hasn't been updated in quite some time so it's been accumulating little bugs as CUDA versions increase and the code itself becomes more stale. I suggest trying to use the random initialization, along with adjusting the number of neighbors and the number of iterations to see if that improves the quality of your embeddings.

We have an engineer ramping up to fix the spectral clustering initialization and they will also be working to improve the end to end quality of the results. Again, I apology sincerely for the delay in replying to this thread.