Open chychen opened 5 years ago
Hello, thanks for the advice! It would be nice to have the explained variance as an attribute, but if you look at the PCA demo under the demos directory you can see how to calculate it with the current implementation for now. Now about the speed, I am not really sure how it is that much faster. However, if that is the case, and the accuracy is comparable, then that's a win for skcuda! I will look into comparing to RAPIDS.
@nmerrill67 thank you so much, I get clearer idea now. one more question, when we try to compare two different PCA accuracy, what is the exactly matrix will you use to compare?
I would look at the orthogonality of the eigenvectors produced. You can see the demo for an example of that with skcuda PCA, but I'm not too sure about other implementations. You can also probably look at the variance explained between different implementations with the same dataset.
@nmerrill67 I meet a few problems while apply cuPCA(n_components=1000) on a weather dataset (shape=[14608, 29161]). why did I get the following results, anything I should adjust?
Sorry for the delay. Can you show the full minimum working example? I would like to see how you construct the array and perform the dot products. The code works fine in the demo.
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
import skcuda.linalg as linalg
from skcuda.linalg import PCA as cuPCA
import skcuda.misc as cumisc
from sklearn.decomposition import PCA as skPCA
pca = cuPCA() # take all principal components
demo_types = [np.float32, np.float64] # we can use single or double precision
precisions = ['single', 'double']
print("Principal Component Analysis Demo!")
print("Compute all 100 principal components of a 1000x100 data matrix")
print("Lets test if the first two resulting eigenvectors (principal components) are orthogonal, by dotting them and seeing if it is about zero, then we can see the amount of the origial variance explained by just two of the original 100 dimensions.\n\n\n")
for i in range(len(demo_types)):
demo_type = demo_types[i]
X = np.random.rand(1000,100).astype(demo_type) # 1000 samples of 100-dimensional data vectors
X_gpu = gpuarray.GPUArray((1000,100), demo_type, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
X_gpu.set(X) # copy data to gpu
T_gpu = pca.fit_transform(X_gpu) # calculate the principal components
dot_product = linalg.dot(T_gpu[:,0], T_gpu[:,1]) # show that the resulting eigenvectors are orthogonal
print("The dot product of the two " + str(precisions[i]) + " precision eigenvectors is: " + str(dot_product))
# now get the variance of each eigenvector so we can see the percent explained by the first two
std_vec = np.std(T_gpu.get(), axis=0)
print("We explained " + str(100 * np.sum(std_vec[:2]) / np.sum(std_vec)) + "% of the variance with 2 principal components in " + str(precisions[i]) + " precision")
explained_ratio = std_vec / np.sum(std_vec)
print(100 * explained_ratio[:20])
# The dot product of the two single precision eigenvectors is: -0.0029296875
# We explained 38.464847894087015% of the variance with 2 principal components in single precision
# [19.150698 19.314152 0.72585124 0.7263616 0.7292958 0.72377294
# 0.7307805 0.70606905 0.7240352 0.72484285 0.6790633 0.6997729
# 0.72107005 0.7098276 0.68848914 0.6984288 0.69108367 0.67857355
# 0.6864188 0.6769071 ]
# The dot product of the two double precision eigenvectors is: -3.637978807091713e-12
# We explained 38.69219545617245% of the variance with 2 principal components in double precision
# [19.01539724 19.67679822 0.71831804 0.67335341 0.71340542 0.70390551
# 0.71937638 0.68879677 0.7214479 0.72892824 0.70459986 0.69231374
# 0.68693753 0.68605994 0.67808385 0.69551102 0.70051158 0.67833128
# 0.67062553 0.68177699]
print('SKLearn Version')
pca_sk = skPCA(n_components=20,svd_solver='randomized',
whiten=True, random_state=12321)
result_sk = pca_sk.fit_transform(X)
print(pca_sk.explained_variance_ratio_)
# SKLearn Version
# [0.01634082 0.01618751 0.01590793 0.01568147 0.01535512 0.01520492
# 0.01501788 0.01472281 0.01458227 0.01424056 0.01396947 0.01394167
# 0.01367067 0.01350416 0.01331921 0.01328396 0.01311839 0.01265229
# 0.0125442 0.01236293]
I modified the demo code in this repository, and compare it with sklearn PCA.
That is strange that the explained variance is not sorted. I am not really sure why that will happen, but it may have something to do with the Gram-Schmidt algorithm used. The ratio for the skcuda version is clearly better for using fewer principal components though compared to sklearn. You can read more about the algorithm here. I based the skcuda version off of the C code provided in the paper.
As for your own data, I am not sure how the dot product is so large. I would ensure that you are using order='F'
when constructing the GPUArray, and that you are dotting correctly. From the demo you can see that it should be very close to zero if not 0.0.
Explained variance in the iris demo is not sorted too.
Problem
I am not famalier to the cublas programming, the implemented algorithm scikit-cuda PCA seems different to sklearn, I try to compare their explained_varianceratio to know which one is better. however, I can't find the attributes such as explained_varianceratio in scikit-cuda, how can I get the attribute? is that possible to add the attributes alike sklearn? it will be much more intuitive. I apply PCA on dummy array with shape [10000, 5000], the speed of scikit-cuda PCA is 150+ times faster than RAPIDS cuML PCA. what makes them different?
Environment