Open yue-wu opened 10 years ago
Hmm you are right something does look off there - its been a while since I worked on this but I remember something about the last component being off with some of the methods. When working with real data the last dimension is usually useless and simply orthogonal to the others so I think these incremental methods might not bother with getting it correct. Maybe have a re-read of the papers and see if they mention it.
Thank you for your efforts. At least, Hull_IPCA works fine. I will use this to find PCA for 3M samples. If I found anything wrong, I shall let you know. By the way, later I noticed that pylearn2 ( still under development ) also has its own online PCA, but I guess it uses yet a different method. For your interests, here is the link to the project http://deeplearning.net/software/pylearn2/index.html.
cool thanks for the link!
If you find any problems feel free to send a patch my way!
Thank you again. It seems that the incremental PCA only solves the problem of possible memory shortage, but it is not a good idea that ask a cluster to use only one core to compute PCA. I am now reading http://mdp-toolkit.sourceforge.net/tutorial/parallel.html. It seems that they provide a way to perform parallel PCA estimation.
I made a toy example to test your code, but I guess it is somewhat incorrect. The following is the code that I used.
under ipython
from sklearn.decomposition import PCA from pyIPCA import CCIPCA, Skocaj_IPCA, Hall_IPCA import numpy as np
make toy data
data = np.random.rand( 10000, 10 ) * 100;
use sklearn pca
ncomp = 2; pca = PCA( n_components = 2 ); pca.fit( data ); data_pca = pca.transform( data ); pyplot.scatter( data_pca[:,0], data_pca[:,1]),pyplot.title('Sklearn-PCA'), pyplot.show()
use CCIPCA
ipca = CCIPCA( n_components = 2 ); ipca.fit( data ); idata_pca = ipca.transform( data ); pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('CCIPCA'), pyplot.show()
use Skocaj_IPCA
ipca = Skocaj_IPCA( n_components = 2 ); ipca.fit( data ); idata_pca = ipca.transform( data ); pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('Skocaj_IPCA'), pyplot.show()
use Hall_IPCA
ipca = Hall_IPCA( n_components = 2 ); ipca.fit( data ); idata_pca = ipca.transform( data ); pyplot.scatter( idata_pca[:,0], idata_pca[:,1]),pyplot.title('Hall_IPCA'), pyplot.show()
It seems that both CCIPCA and Skocaj_pca does not work properly, because their center after transformation is too far away from the origin (0,0) and their shapes are more like a oval rather than a circle.
By the way the Skocaj_IPCA often invokes the following warning on my machine: RuntimeWarning: invalid value encountered in divide explainedvariance.sum())
Many thanks to your contributions in sklearn
Rex