Implement rank deficiency check in vector_auto_regression

witherscp commented 1 year ago

Describe the problem

The current version of vector_auto_regression has an option to manually input l2_reg ridge penalty, but it does not check whether the matrix is rank deficient, in the case that the user chooses the default l2_reg of 0 (i.e., no regularization). This leads to a problem where the outputs are useless if the user does not realize that their matrix is rank deficient.

Describe your solution

I want to check for the matrix rank to ensure that it matches the number of nodes across all epochs. If any epoch is rank deficient, then for every epoch sklearn.linear_model.RidgeCV will be employed to automatically search for optimal alpha value, which will be used to regularize the matrix. This is more useful than asking the user to input an l2_reg because RidgeCV can solve for the best value using cross-validation.

Describe possible alternatives

A possible alternative would be to apply Ridge regression to all input matrices, regardless of whether they are full rank. Another option would be to remove the l2_reg value and use the automated search for all matrices if the user chooses.

Additional context

This issue was created based on this discussion post.

I am going to work on this and submit a pull request with my progress soon!

adam2392 commented 1 year ago

Hi @witherscp thanks for opening up an issue!

Checking matrix rank would not be sufficient as there is a linear algebra result "almost every matrix is diagonalizable" (i.e. full rank) because of numerical precision. A better value to check would be the condition number (using numpy/scipy).

Re cross-validation step: I am okay w/ adding this in, but a bunch of open questions that are not that simple. However, what would the score be measured on? Is it how well it fits the training data?, or how well it fits the "next epoch", or "the next n time steps"?

witherscp commented 1 year ago

If I check condition number using np.linalg.cond(), how can I know whether it would be considered too large? In my sample data, when there is no common average reference the condition values are around ~1000 for each epoch, whereas they are 10^15 when the common average reference is applied which is ill-conditioned.

In terms of the cross-validation, I was planning on using 5-fold cross validation with: reg = RidgeCV(alphas=np.logspace(-15,5,21), cv=5).fit(z, y_sample) within the function _estimate_var. In this case, I believe the epoch is divided into 5 groups, with each one being left out and the fit determined separately. The score would be based on how well it fits the training data within the epoch. Does that make sense to use? Empirically, in the data I have tried it on, the alpha value tends to be 1e-8 or 1e-7 and the R^2 score within an epoch for the optimal alpha is around 0.4-0.7.

adam2392 commented 1 year ago

If I check condition number using np.linalg.cond(), how can I know whether it would be considered too large? In my sample data, when there is no common average reference the condition values are around ~1000 for each epoch, whereas they are 10^15 when the common average reference is applied which is ill-conditioned.

We could just come up with a sensible default (e.g. 1e6 or 1e7)?

In terms of the cross-validation, I was planning on using 5-fold cross validation with: reg = RidgeCV(alphas=np.logspace(-15,5,21), cv=5).fit(z, y_sample) within the function _estimate_var. In this case, I believe the epoch is divided into 5 groups, with each one being left out and the fit determined separately. The score would be based on how well it fits the training data within the epoch. Does that make sense to use? Empirically, in the data I have tried it on, the alpha value tends to be 1e-8 or 1e-7 and the R^2 score within an epoch for the optimal alpha is around 0.4-0.7.

Sure I think this makes sense. But I suspect might need to split the epochs such that they predict some validation epoch (might be shorter than the window lengths). But maybe let's try it out in your PR and if you can show a good example on simulated (or real data) that would be good to inform the settings.

To make ill conditioned data in simulation you can just add redundant sensors with small noise or something.

Feel free to open the PR early to get some initial feedback.

witherscp commented 1 year ago

Submitted the PR. The dynamic model seems to work well. The residuals are quite small, even less than the OLS. I'm not sure if the average epochs model works as well because the optimal alpha value is the absolute minimum of the np.logspace I provide and if I give a bigger range then the alpha value continues to drop but there is an ill-conditioned solving error. So far, I have only tested with the ECOG dataset using a common average reference.

mne-tools / mne-connectivity