rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.17k stars 528 forks source link

[BUG] different outputs for PCA on CPU vs. GPU #5473

Open stephanie-fu opened 1 year ago

stephanie-fu commented 1 year ago

Describe the bug Using cuml.PCA with set_global_device_type to 'CPU' and 'GPU' produce different results (with set_global_device_type('CPU') matching the output of sklearn's PCA).

Steps/Code to reproduce bug

from cuml.common.device_selection import set_global_device_type
import matplotlib.pyplot as plt
import torch

from sklearn.decomposition import PCA as skPCA
from cuml import PCA as cuPCA

data = torch.randn((10000, 2)).cuda()
data[:, 0] *= 10

plt.scatter(data[:, 0].cpu(), data[:, 1].cpu())
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

# 1: cuML on CPU
set_global_device_type('CPU')
dimred = cuPCA(n_components=2, output_type='numpy')
data_dimred = dimred.fit_transform(data)
plt.scatter(data_dimred[:, 0], data_dimred[:, 1])
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

# 2: cuML on GPU
set_global_device_type('GPU')
dimred = cuPCA(n_components=2, output_type='numpy')
data_dimred = dimred.fit_transform(data)
plt.scatter(data_dimred[:, 0], data_dimred[:, 1])
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

# 3: sklearn on CPU
dimred = skPCA(n_components=2)
data_dimred = dimred.fit_transform(data.cpu())
plt.scatter(data_dimred[:, 0], data_dimred[:, 1])
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

Expected behavior All 3 examples above are expected to have the same output.

Environment details (please complete the following information):

lowener commented 1 year ago

@stephanie-fu This is related to https://github.com/rapidsai/cuml/issues/4560. The signs are flipped, but the result is still valid.

antortjim commented 1 year ago

@lowener Could you please explain how are the results still valid? Indeed, I would expect the output plot to be the same in all three cases. Thank you!

lowener commented 1 year ago

The issue that you opened on UMAP is not related to this. The problem on this PCA issue is this function https://github.com/rapidsai/cuml/blob/branch-23.08/cpp/src/tsvd/tsvd.cuh#L137 that is used in PCA and TSVD. Sign flipping doesn't change the correctness of the results. Some columns of U and V can be positive or negative and that won't affect the projection of the data, because a PCA analysis is looking into how much each variable contributes to the data, positively or negatively. Here is a more in-depth answer. https://stats.stackexchange.com/a/88882

stephanie-fu commented 1 year ago

Thank you for the clarification. I am still a little confused about the output of the example above - I would expect all 3 code examples to look like an ellipse, but am getting something from the GPU implementation that visually looks different from a sign flip (in fact, it looks like the CPU output is a flipped version of the data, which seems valid). Is this expected output?

image

antortjim commented 1 year ago

The issue that you opened on UMAP is not related to this. The problem on this PCA issue is this function https://github.com/rapidsai/cuml/blob/branch-23.08/cpp/src/tsvd/tsvd.cuh#L137 that is used in PCA and TSVD. Sign flipping doesn't change the correctness of the results. Some columns of U and V can be positive or negative and that won't affect the projection of the data, because a PCA analysis is looking into how much each variable contributes to the data, positively or negatively. Here is a more in-depth answer. https://stats.stackexchange.com/a/88882

Thank you for reply. Yes, I imagined probably the root causes are different and thus the solutions or interpretations are gonna be different. I just linked this there because they are similar in the sense that both PCA and UMAP are extremely used algorithms which a lot of people will try to accelerate with cuml. However, doing so they will face the conundrum of getting difficult to interpret results. Maybe in both cases more documentation would be beneficial.

I agree with with @stephanie-fu that if the only explanation here with the PCA is a sign flip, that still does not explain the loss of structure observed on the "PCA on GPU" plot. The GPU output is clearly not maximizing the amount of variance on one axis. In other wors, the expected output, regardless of the sign flip, would be that most of the variance will lie along one axis and only a little bit along the opposite axis (opposite to the round blob observed). How can we explain it?

Thank you again!

lowener commented 1 year ago

I jumped too fast to the sign flipping conclusion. The issue seems to be our compatibility with torch Tensor? Transforming the data to a cupy array is fixing the problem that you're seeing and can be used as a workaround:

import cupy
import torch
data = torch.randn((10000, 2)).cuda()
data[:, 0] *= 10
data = cupy.array(data)
antortjim commented 1 year ago

I confirm that making the data a cupy array solves the problem here! This solution is fully compatible with @stephanie-fu 's code except for the calls to .cpu() which need to be replaced by .get().

antortjim commented 1 year ago

i.e. this code

from cuml.common.device_selection import set_global_device_type
import matplotlib.pyplot as plt
import torch
import cupy
from sklearn.decomposition import PCA as skPCA
from cuml import PCA as cuPCA

data = torch.randn((10000, 2)).cuda()
data[:, 0] *= 10

data = cupy.array(data)

plt.scatter(data[:, 0].get(), data[:, 1].get())
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

# 1: cuML on CPU
set_global_device_type('CPU')
dimred = cuPCA(n_components=2, output_type='numpy')
data_dimred = dimred.fit_transform(data)
plt.scatter(data_dimred[:, 0], data_dimred[:, 1])
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

# 2: cuML on GPU
set_global_device_type('GPU')
dimred = cuPCA(n_components=2, output_type='numpy')
data_dimred = dimred.fit_transform(data)
plt.scatter(data_dimred[:, 0], data_dimred[:, 1])
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()

# 3: sklearn on CPU
dimred = skPCA(n_components=2)
data_dimred = dimred.fit_transform(data.get())
plt.scatter(data_dimred[:, 0], data_dimred[:, 1])
plt.gca().set_xlim(-50, 50)
plt.gca().set_ylim(-50, 50)
plt.show()
stephanie-fu commented 1 year ago

Thank you for the workaround!