RandomizedPCA before ICA

dengemann commented 12 years ago

Hi folks,

I would like to include RandomizedPCA from sklearn as one station before the ICA. This would have at least two obvious advantages. First, it would speed up the decomposition, second we could pass explained variance criteria instead of n_components, which I guess is more systematic / controlled natural than deciding the number of components, at best, heuristically. The rPCA would then do the whitening. One issue I see however is, I guess a minor issue, that we would slightly have to change the API due to the different init requirements. I.e. for the rPCA we need to tell the number of components in advance. If we want to use the rPCA to inform our n-components-choice on init of ICA this won't work however. I see two options. A) displacing the picks arg from the ICA.decompose_XXX methods to ICA.init --> len(picks) will tell rPCA how many components there are, rPCA will tell ICA the n_components and the n_components arg can pass, if float between 0 and 1, the explained variance selection criterion. B) Putting rPCA inside the ICA.decompose_XXX methods and do basically the same.

I have a preference for A) because the entire ICA workflow is fixed to the channel structure passed on decomposition. As far as I see it, nothing would be lost with this move. In fact the code would become more compact. B) could do it as well, but the rPCA would then appear in two methods and I think with regard to parameters and the interface it would take efforts to keep it consistent.

What do you think?

Denis

dengemann commented 12 years ago

.... On closer inspection things are somewhat more complicated, we need both the ICA / PCA and the data together in one call. The two new options then are A) pass data containers, that is, raw or epochs objects, on init of ICA or B) Do PCA / ICA initing on calling ica.decompose_epochs/raw. As to A) I would like this, because the structure chosen for decomposition will determine the session bound to the instance of ICA eitherway, why not commit yourself to this choice on init. The B) choice would make the init superfluous, maybe just setting up some very general settings (but which ones?), everything needed for initializing would flow through decompose_XXX. On the othere hand side this option would be more consistent with the direction pursued so far.

agramfort commented 12 years ago

the way I see it is to do the whitening in python and call FastICA with whiten=False

to take care of the number of dimensions / components as a fraction of the variance I would pass a maximum number of components (max_n_components) and then select only the first ones or raise an error if max_n_components was too small.

dengemann commented 12 years ago

On 12.11.2012, at 21:28, Alexandre Gramfort notifications@github.com wrote:

the way I see it is to do the whitening in python and call FastICA with whiten=False

exactly. as I see it, to do so, we need to supply data, n_components / picks in one function / method call, i.e., init or decompose.

to take care of the number of dimensions / components as a fraction of the variance I would pass a maximum number of components (max_n_components) and then select only the first ones or raise an error if max_n_components was too small.

hmm. seems somewhat unintuitively to me. wouldn't a direct float arg be simpler? not sure whether i got the point. i'd suggest to issue a first PR to avoid linguistic ambiguities :)

D

— Reply to this email directly or view it on GitHub.

agramfort commented 12 years ago

exactly. as I see it, to do so, we need to supply data, n_components / picks in one function / method call, i.e., init or decompose.

yes

to take care of the number of dimensions / components as a fraction of the variance I would pass a maximum number of components (max_n_components) and then select only the first ones or raise an error if max_n_components was too small.

hmm. seems somewhat unintuitively to me. wouldn't a direct float arg be simpler? not sure whether i got the point. i'd suggest to issue a first PR to avoid linguistic ambiguities :)

the problem is that you don't know in advance the number of components to ask to the RandomizedPCA and I am not sure of how to compute it incrementally. So my approach would be to ask more and only keep the first ones according to a float value

makes sense?

dengemann commented 12 years ago

On Tue, Nov 13, 2012 at 1:10 PM, Alexandre Gramfort < notifications@github.com> wrote:

exactly. as I see it, to do so, we need to supply data, n_components / picks in one function / method call, i.e., init or decompose.

yes

.. which implies some restructuring.

to take care of the number of dimensions / components as a fraction of the variance I would pass a maximum number of components (max_n_components) and then select only the first ones or raise an error if max_n_components was too small.

hmm. seems somewhat unintuitively to me. wouldn't a direct float arg be simpler? not sure whether i got the point. i'd suggest to issue a first PR to avoid linguistic ambiguities :)

the problem is that you don't know in advance the number of components to ask to the RandomizedPCA and I am not sure of how to compute it incrementally. So my approach would be to ask more and only keep the first ones according to a float value

makes sense?

Yes, now I got you --- consensus alarm ;-) --- my naive approach was just to pass n_components = len(picks) to the PCA whereas the actual n_components art is then used take the firs n (if integer) components or (if float) the components for which the cumsum is smaller than 0. < k < 1.0. So does it mean that just passing len(picks) as n_components for pca might fail?

Let me see.

— Reply to this email directly or view it on GitHubhttps://github.com/mne-tools/mne-python/issues/184#issuecomment-10324176.

agramfort commented 12 years ago

So does it mean that just passing len(picks) as n_components for pca might fail?

no but it will be slower and more memory consuming

dengemann commented 12 years ago

I see... So max_n_components to get .k explained variance to get n_components... I'll have to play around with it.

On Tue, Nov 13, 2012 at 2:02 PM, Alexandre Gramfort < notifications@github.com> wrote:

So does it mean that just passing len(picks) as n_components for pca might fail?

no but it will be slower and more memory consuming

— Reply to this email directly or view it on GitHubhttps://github.com/mne-tools/mne-python/issues/184#issuecomment-10325405.

dengemann commented 12 years ago

Thanks for the discussion, let's move it to the new related PR. Closing this.

mne-tools / mne-python

RandomizedPCA before ICA #184