kevinhughes27 / pyIPCA

python package for Incremental Prinicpal Component Analysis
18 stars 4 forks source link

Problematic results #2

Open yue-wu opened 10 years ago

yue-wu commented 10 years ago

I made a toy example to test your code, but I guess it is somewhat incorrect. The following is the code that I used.

under ipython

from sklearn.decomposition import PCA from pyIPCA import CCIPCA, Skocaj_IPCA, Hall_IPCA import numpy as np

make toy data

data = np.random.rand( 10000, 10 ) * 100;

use sklearn pca

ncomp = 2; pca = PCA( n_components = 2 ); pca.fit( data ); data_pca = pca.transform( data ); pyplot.scatter( data_pca[:,0], data_pca[:,1]),pyplot.title('Sklearn-PCA'), pyplot.show() sklearn_pca

use CCIPCA

ipca = CCIPCA( n_components = 2 ); ipca.fit( data ); idata_pca = ipca.transform( data ); pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('CCIPCA'), pyplot.show() ccipca

use Skocaj_IPCA

ipca = Skocaj_IPCA( n_components = 2 ); ipca.fit( data ); idata_pca = ipca.transform( data ); pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('Skocaj_IPCA'), pyplot.show() skocaj_ipca

use Hall_IPCA

ipca = Hall_IPCA( n_components = 2 ); ipca.fit( data ); idata_pca = ipca.transform( data ); pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('Hall_IPCA'), pyplot.show()

hall_ipca

It seems that both CCIPCA and Skocaj_pca does not work properly, because their center after transformation is too far away from the origin (0,0) and their shapes are more like a oval rather than a circle.

By the way the Skocaj_IPCA often invokes the following warning on my machine: RuntimeWarning: invalid value encountered in divide explainedvariance.sum())

Many thanks to your contributions in sklearn

Rex

kevinhughes27 commented 10 years ago

Hmm you are right something does look off there - its been a while since I worked on this but I remember something about the last component being off with some of the methods. When working with real data the last dimension is usually useless and simply orthogonal to the others so I think these incremental methods might not bother with getting it correct. Maybe have a re-read of the papers and see if they mention it.

yue-wu commented 10 years ago

Thank you for your efforts. At least, Hull_IPCA works fine. I will use this to find PCA for 3M samples. If I found anything wrong, I shall let you know. By the way, later I noticed that pylearn2 ( still under development ) also has its own online PCA, but I guess it uses yet a different method. For your interests, here is the link to the project http://deeplearning.net/software/pylearn2/index.html.

kevinhughes27 commented 10 years ago

cool thanks for the link!

If you find any problems feel free to send a patch my way!

yue-wu commented 10 years ago

Thank you again. It seems that the incremental PCA only solves the problem of possible memory shortage, but it is not a good idea that ask a cluster to use only one core to compute PCA. I am now reading http://mdp-toolkit.sourceforge.net/tutorial/parallel.html. It seems that they provide a way to perform parallel PCA estimation.