rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.24k stars 532 forks source link

[BUG] UMAP transform accuracy. #3864

Open trivialfis opened 3 years ago

trivialfis commented 3 years ago

I tried to compare the results between CPU UMAP and GPU UMAP with fashion mnist dataset, it seems the CPU implementation is more accuracy from a visualization point. The comparison is made between branch-0.20 of cuml and 0.5.1 of CPU UMAP, both are run with seed:

CPU GPU
transform-cpu transform-gpu

Sample code:

import os
import gzip
import numpy as np
import cuml
from bokeh.plotting import figure, output_file, show
from bokeh.models import CategoricalColorMapper, ColumnDataSource
from bokeh.palettes import Category10
import umap

def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels

X, y = load_mnist("fashion-mnist/data/fashion")

n_epochs = None
model = cuml.manifold.UMAP(random_state=1994)
# Use CPU or GPU
# model = umap.UMAP(random_state=1994)

embedding = model.fit_transform(X)
output_file("fashion.html")

def plot_fit():
    targets = [str(d) for d in range(10)]

    source = ColumnDataSource(
        dict(
            x=[e[0] for e in embedding],
            y=[e[1] for e in embedding],
            label=[targets[d] for d in y],
        )
    )

    cmap = CategoricalColorMapper(factors=targets, palette=Category10[10])

    p = figure(title="test umap")
    p.circle(
        x="x",
        y="y",
        source=source,
        color={"field": "label", "transform": cmap},
        legend="label",
    )

    show(p)

def plot_transform(model):
    n = X.shape[0] // 2
    transformed = model.transform(X[:n])
    targets = [str(d) for d in range(10)]
    labels = y[:n]

    source = ColumnDataSource(
        dict(
            x=[e[0] for e in transformed],
            y=[e[1] for e in transformed],
            label=[targets[d] for d in labels],
        )
    )

    cmap = CategoricalColorMapper(factors=targets, palette=Category10[10])

    p = figure(title="test umap")
    p.circle(
        x="x",
        y="y",
        source=source,
        color={"field": "label", "transform": cmap},
        legend="label",
    )
    print("Show transformed")
    show(p)

plot_fit()
plot_transform(model)
cjnolet commented 3 years ago

We synced offline about this and are both able to produce the correct result and match the reference implementation by adjusting n_epochs=200 However, setting n_epochs=0 (the default) should be getting a better result than it is, whether that means we need to increase the number of epochs for the default or something else is going on.

Here's an image of the result when n_epochs is explicitly set to 200 (for random_state=1994): image

And here's an image when it's set to the default: image

trivialfis commented 3 years ago

Actually, for both fit and transform, when the number of epochs is small, CPU result is better than GPU.

cjnolet commented 3 years ago

Yep, I do agree that I have seen slower convergence of the optimizer in the past with both training and inference. For example, setting n_epochs=25 does tend to converge faster on CPU, but that value will often be set much higher internally by default in both implementations.

At one point I had dived deeply into the cause of slower convergence and traced it to the consistency issues from the data races during the optimization step. However, I would have expected that to go away by setting random_state to a nonzero value so this very well may be from a different cause.

trivialfis commented 3 years ago

I think initialization is part of the problem. Following pictures are plots after a single iteration of fitting with the shuttle dataset, from which you can see the CPU impl already has well-separated clusters. CPU: 0 GPU: 0

cjnolet commented 3 years ago

We're just performing a spectral embedding on the entire dataset, rather than the "multi-component layout" approach which is being used in UMAP (and was considered experimental at the time). I would still have expected the resulting embedding to have more separation than this, though and I'm wondering if the lanczos solver might not be converging. What dataset is this? Can you try computing the spectral embedding using Scikit-learn's SpectralEmbedding and see if it improves the separation?

For example, here's a spectral embedding of the digits dataset from scikit-learn digits_spectral

I suppose another possibility is that the connectivities graph could be incorrect, however if that were the case, I'd think the UMAP solver would likely not converge at all.

trivialfis commented 3 years ago

It's the shuttle dataset from UCI. Thanks for suggestions, I will try to pin point the issue.

trivialfis commented 3 years ago

This is the spectral embedding output from sklearn with shuttle dataset:

spectral

OuNao commented 3 years ago

Hi,

I'm getting problems with cuml umap.transform too.

I'm a R developer trying to use reticlate/cuML to speedup my analisys.

Sample data: https://drive.google.com/file/d/1oJjmNAS_KAw4tH_DCNG4IN2UHg7tyEV1/view?usp=sharing

data2 = data with last row as 1st. In R: data2<-data[c(93061, 1:93060),]

Results from cuML and umap-learn

  1. res = cuml.UMAP(random_state=1994).fit(data) cuMLumap fit

  2. res2 = res.transform(data2) cuMLumap transform

  3. res3 = umap.UMAP(random_state=1994).fit(data) umap-learn fit

  4. res4 = res3.transform(data2) umap-learn transform

I tried int = "spectral" and init = "random" with no improvements.

I have datasets with many millions of rows and I think that fit a portion of the rows (gpu memory limit) and transform the rest is a better choice, but with this lower accuracy I can't use gpu yet.

This can be fixed? Or this type of error is inherent to gpu parallelism?

Thanks.

ENV: debian 10 RTX 2060 super nvidia driver: 460 cuda 11.2

EDIT1: Fixed/improved with n_epochs=500. Using the default n_epochs as None, transform use n_epochs=30. Using n_epochs=500, transform use n_epochs=166. umap-learn with default n_epochs (200 fit, 30 transform) is ok. 30 epochs for cuml.UMAP.transform is not enough. For my needs, using n_epochs=500 is ok! Thanks.

fit: cuMLumap_n_epochs500 transfom: cuMLumap_n_epochs500_transform

trivialfis commented 3 years ago

So far I narrowed it down to the initialization step. Will update once I have something new.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.