Make CSP implementation compatible with scikit-learn?

cbrnr commented 9 years ago

I just wrote a CSP class which can be used as a scikit-learn estimator. I think it would be nice if we replaced the existing implementation with the new one, because that way CSP can be used in a broader context.

I've based my implementation on the structure used in MNE's CSP class, but our actual CSP algorithm is a bit different than theirs. Maybe it would make sense to join forces here and develop a unified CSP implementation? CC @agramfort @alexandrebarachant if you're interested let me know and we can discuss the differences between our implementations.

agramfort commented 9 years ago

what are the differences? is it API?

alexandrebarachant commented 9 years ago

Hi, I see two main differences :

the scot implementation compute covariances matrices of the epochs and then average them while MNE concatenate epochs to produce a big signal and then estimates the cov. Under some circumstance it may change a bit.
scot directly estimate spatial filters by joint eig. dec. of (C1 , C1 + C2) while MNE first do a whitening of C1 and C2 by (C1 + C2) and then do the eig. dec of (Cw1, Cw2). I don't really know if it makes a big difference.

I believe the scot implementation follow the implementation from the Blankertz paper about CSP while MNE follows the one from the Koles paper.

cbrnr commented 9 years ago

Exactly, these are the main differences. In addition, there is the question on how to normalize the CSP filters (i.e. eigenvectors). There is no right solution, but some solutions might make more sense (e.g. normalizing to unit length or normalizing in such a way that the features are standardized).

Finally, the shape of the input data differs from SCoT. In MNE, it is (trials, channels, samples), whereas in SCoT the first and last axes are swapped (samples, channels, trials). It seems like only the former works with scikit-learn estimators (why?), so SCoT should probably adopt this shape (see #63 for this issue).

Regarding the two main differences, the covariance estimation methods should be identical (except for a small factor which doesn't really change anything). However, the method adopted by SCoT might work better with regularization, because many smaller matrices will be shrunk more as compared to one covariance matrix computed from the concatenated trials. I'm not sure, but this could be tested. We could have both methods and let the user choose though.

The estimation of the spatial filters is identical, just like @alexandrebarachant says, MNE uses the step by step approach described by Koles, whereas SCoT takes a shortcut by directly solving the generalized eigenvalue problem S1, S1 + S2. I think the latter is shorter and probably easier to read.

alexandrebarachant commented 9 years ago

Indeed, I generally prefer using trial by trial covariance estimation because i like to use other metric to average the covariance matrices (i.e. Riemannian geometric mean is generally more robust).

About the shape of the input data, this is to be compatible with sklearn API. sklearn expect trial on the first dim so we have to keep it that way.

mbillingr commented 9 years ago

MNE's order of dimensions (trials, channels, samples) feels more natural and it is nicer to the cache when looping over trials or bootstrapping by trial. However, keep in mind that this is actually inconsistent with sklearn, who use (n_samples, n_features). In case of CSP the 'features' are channels.

Still, I prefer MNE's order and in the long term scot should switch to that too.

Might be worthwile to consider factoring out the covariance estimation, so we could plug different covariance estimators into the CSP transform.

Is there any information available that compares the step by step approach to the generalized Eigenvalue decomposition in terms of performance and numerical stability?

cbrnr commented 9 years ago

I'm not sure if this is inconsistent with sklearn - if we compute one value per trial, then this actually corresponds to a sample. This is why the zeroth dimension must be equal to the number of trials - otherwise, the length of the labels vector y is not compatible.

The inconsistent thing is the continuous EEG - MNE puts it into (n_chans, n_samples). You could argue that this doesn't correspond to sklearn's (n_samples, n_features) format.

Factoring out covariance estimation was already done in MNE, so yes, we should do that!

I once gave a lecture on CSP, and I dug out Fukunaga, a statistical text book often cited in the context of CSP (simultaneous diagonalization). AFAIK, the single steps are totally equivalent to the generalized eigenvalue decomposition. Numerically, I'm not sure, but I guess splitting the computation up in many steps might be prone to propagating numerical errors more than having only one step.

mbillingr commented 9 years ago

Let's leave it at that 3D input can never be 100% consistent with sklearn's 2D input, no matter how we arrange the dimensions. It's more important to be consistent with MNE.

Using the generalized Eigendecomposition from scipy has the advantage that someone else is responsible for the implementation, and if they improve it in the future we get the update for free :) (Well, if they break it we choke on that too..)

Any thoughts on feature normalization?

cbrnr commented 9 years ago

OK, making the axes compatible with MNE sounds good. To recap, this means:

2D data needs to be changed to (n_chans, n_samples)
3D data needs to be changed to (n_trials, n_chans, n_samples)

If you agree, I'll put this into #63 and we'll continue over there.

Regarding eigendecomposition, I prefer the one-step approach.

Regarding feature normalization, I think this is important since in our implementation, the features got so small (1e-12) that most classifiers didn't work anymore. Therefore, as @kazemakase suggested elsewhere I would normalize the filters so that the resulting features are standardized. If this is too complicated (or contrived), I would make the filters (eigenvectors) unit length and then perform feature standardization in the transform method.

agramfort commented 9 years ago

thanks for clarifying.

If some computational/robustness improvements can be made to the CSP code don't hesitate to submit a PR.

cbrnr commented 8 years ago

I'm closing this since all relevant changes have been implemented in MNE.

scot-dev / scot

Make CSP implementation compatible with scikit-learn? #64