lebedov / scikit-cuda

Python interface to GPU-powered libraries
http://scikit-cuda.readthedocs.org/
Other
982 stars 179 forks source link

question about the PCA attributes #270

Open chychen opened 5 years ago

chychen commented 5 years ago

Problem

I am not famalier to the cublas programming, the implemented algorithm scikit-cuda PCA seems different to sklearn, I try to compare their explained_varianceratio to know which one is better. however, I can't find the attributes such as explained_varianceratio in scikit-cuda, how can I get the attribute? is that possible to add the attributes alike sklearn? it will be much more intuitive. I apply PCA on dummy array with shape [10000, 5000], the speed of scikit-cuda PCA is 150+ times faster than RAPIDS cuML PCA. what makes them different?

Environment

nmerrill67 commented 5 years ago

Hello, thanks for the advice! It would be nice to have the explained variance as an attribute, but if you look at the PCA demo under the demos directory you can see how to calculate it with the current implementation for now. Now about the speed, I am not really sure how it is that much faster. However, if that is the case, and the accuracy is comparable, then that's a win for skcuda! I will look into comparing to RAPIDS.

chychen commented 5 years ago

@nmerrill67 thank you so much, I get clearer idea now. one more question, when we try to compare two different PCA accuracy, what is the exactly matrix will you use to compare?

nmerrill67 commented 5 years ago

I would look at the orthogonality of the eigenvectors produced. You can see the demo for an example of that with skcuda PCA, but I'm not too sure about other implementations. You can also probably look at the variance explained between different implementations with the same dataset.

chychen commented 5 years ago

@nmerrill67 I meet a few problems while apply cuPCA(n_components=1000) on a weather dataset (shape=[14608, 29161]). why did I get the following results, anything I should adjust?

  1. I print out explained_varianceratio% of all 20 eigenvectors but it seems unordered, which make me confused, shouldn't the algorithm sort the vectors by its eigenvalues?
  2. the dot product of first two eigenvectors is huge, why?
  3. why the result of the first two eigenvectors get huge difference between single and double precision?
  4. should I apply normalization on data before PCA?
    • without normalization on data before PCA
    • The dot product of the first two single precision eigenvectors is: 68719480000.0
    • explained_varianceratio%: [1.4833782e+00 6.2119293e+01 2.3850111e-02 2.2604733e-03 7.9570673e-03 5.7556047e-03 3.3993192e-03 4.7925347e-03 3.6503188e-03 4.8022955e-03 4.1354685e-03 4.2750631e-03 4.7590258e-03 4.5807809e-03 3.9173421e-03 2.4755828e-03 2.4638816e-03 4.1929199e-03 3.0926433e-03 2.6495401e-03]
    • The dot product of the first two double precision eigenvectors is: 768.0
    • explained_varianceratio%: [2.31805225e+00 9.70760008e+01 3.73018648e-02 3.17185410e-02 1.97747878e-02 1.39515186e-02 1.14985860e-02 1.23412252e-02 1.08304167e-02 7.91128147e-03 1.05471510e-02 8.20337914e-03 7.58816510e-03 7.51196821e-03 4.84178303e-03 5.31908552e-03 4.50089480e-03 5.17230226e-03 4.46902245e-03 3.88342690e-03]
    • with normalization on data before PCA
    • The dot product of the first two single precision eigenvectors is: 1.78125
    • explained_varianceratio%: [1.4711514 4.986023 2.6114032 1.8182194 2.5145812 1.2482839 2.2641108 1.858542 1.3103851 1.435258 1.3762304 1.0806296 1.1823533 0.72226316 1.4757944 0.88305855 0.8263093 0.94232464 0.7360602 0.60788965]
    • The dot product of the first two double precision eigenvectors is: 9.313225746154785e-10
    • explained_varianceratio%: [1.4712189 4.98624987 2.6115224 1.81830222 2.51469619 1.24834035 2.26421319 1.85863199 1.31044151 1.43532403 1.37628676 1.08068028 1.1824082 0.72226359 1.4759095 0.88310218 0.82634353 0.94235993 0.73608959 0.60785987]
nmerrill67 commented 5 years ago

Sorry for the delay. Can you show the full minimum working example? I would like to see how you construct the array and perform the dot products. The code works fine in the demo.

chychen commented 5 years ago

import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
import skcuda.linalg as linalg
from skcuda.linalg import PCA as cuPCA
import skcuda.misc as cumisc
from sklearn.decomposition import PCA as skPCA

pca = cuPCA() # take all principal components

demo_types = [np.float32, np.float64] # we can use single or double precision
precisions = ['single', 'double']

print("Principal Component Analysis Demo!")
print("Compute all 100 principal components of a 1000x100 data matrix")
print("Lets test if the first two resulting eigenvectors (principal components) are orthogonal, by dotting them and seeing if it is about zero, then we can see the amount of the origial variance explained by just two of the original 100 dimensions.\n\n\n")

for i in range(len(demo_types)):

    demo_type = demo_types[i]
    X = np.random.rand(1000,100).astype(demo_type) # 1000 samples of 100-dimensional data vectors
    X_gpu = gpuarray.GPUArray((1000,100), demo_type, order="F") # note that order="F" or a transpose is necessary. fit_transform requires row-major matrices, and column-major is the default
    X_gpu.set(X) # copy data to gpu
    T_gpu = pca.fit_transform(X_gpu) # calculate the principal components
    dot_product = linalg.dot(T_gpu[:,0], T_gpu[:,1]) # show that the resulting eigenvectors are orthogonal 
    print("The dot product of the two " + str(precisions[i]) + " precision eigenvectors is: " + str(dot_product))
    # now get the variance of each eigenvector so we can see the percent explained by the first two
    std_vec = np.std(T_gpu.get(), axis=0)
    print("We explained " + str(100 * np.sum(std_vec[:2]) / np.sum(std_vec)) + "% of the variance with 2 principal components in " +  str(precisions[i]) +  " precision")
    explained_ratio = std_vec / np.sum(std_vec)
    print(100 * explained_ratio[:20])
    # The dot product of the two single precision eigenvectors is: -0.0029296875
    # We explained 38.464847894087015% of the variance with 2 principal components in single precision
    # [19.150698   19.314152    0.72585124  0.7263616   0.7292958   0.72377294
    # 0.7307805   0.70606905  0.7240352   0.72484285  0.6790633   0.6997729
    # 0.72107005  0.7098276   0.68848914  0.6984288   0.69108367  0.67857355
    # 0.6864188   0.6769071 ]

    # The dot product of the two double precision eigenvectors is: -3.637978807091713e-12
    # We explained 38.69219545617245% of the variance with 2 principal components in double precision
    # [19.01539724 19.67679822  0.71831804  0.67335341  0.71340542  0.70390551
    # 0.71937638  0.68879677  0.7214479   0.72892824  0.70459986  0.69231374
    # 0.68693753  0.68605994  0.67808385  0.69551102  0.70051158  0.67833128
    # 0.67062553  0.68177699]

print('SKLearn Version')
pca_sk = skPCA(n_components=20,svd_solver='randomized', 
            whiten=True, random_state=12321)
result_sk = pca_sk.fit_transform(X)
print(pca_sk.explained_variance_ratio_)
# SKLearn Version
# [0.01634082 0.01618751 0.01590793 0.01568147 0.01535512 0.01520492
#  0.01501788 0.01472281 0.01458227 0.01424056 0.01396947 0.01394167
#  0.01367067 0.01350416 0.01331921 0.01328396 0.01311839 0.01265229
#  0.0125442  0.01236293]

I modified the demo code in this repository, and compare it with sklearn PCA.

nmerrill67 commented 5 years ago

That is strange that the explained variance is not sorted. I am not really sure why that will happen, but it may have something to do with the Gram-Schmidt algorithm used. The ratio for the skcuda version is clearly better for using fewer principal components though compared to sklearn. You can read more about the algorithm here. I based the skcuda version off of the C code provided in the paper.

As for your own data, I am not sure how the dot product is so large. I would ensure that you are using order='F' when constructing the GPUArray, and that you are dotting correctly. From the demo you can see that it should be very close to zero if not 0.0.

bckim1318 commented 2 years ago

Explained variance in the iris demo is not sorted too.