Implement automatic ridge regression

witherscp commented 1 year ago

PR Description

Closes #123

Adds auto_reg parameter to vector_auto_regression. This will regularize the input matrix using scikit-learn's RidgeCV which searches for optimal alpha value. The auto_reg parameter is automatically set to True for ill-conditioned matrices.

Merge checklist

Maintainer, please confirm the following before merging:

[ ] All comments resolved
[ ] This is not your own PR
[ ] All CIs are happy
[ ] PR title starts with [MRG]
[ ] whats_new.rst is updated
[ ] PR description includes phrase "closes <#issue-number>"

witherscp commented 1 year ago

I agree that using only the l2_reg argument will make things much simpler. Do you want me to implement this and then commit the changes?

adam2392 commented 1 year ago

I agree that using only the l2_reg argument will make things much simpler. Do you want me to implement this and then commit the changes?

That would be great and then I can help review the code!

witherscp commented 1 year ago

I opted to keep the underlying functions unchanged w/r/t the l2_reg parameter, but I had to overwrite the l2_reg value in line 185 to avoid problems. This isn't clean for the code but will not affect the user experience. I also thought it was helpful to include the cross-validation alpha values in the displayed model parameters, so that the user can view them if verbose is set to True.

witherscp commented 1 year ago

I've never worked with unit tests before. What would be a good unit test to add for the new features and how do I add that? As for the current tests that are failing, I am getting the same problem across 10 tests: numpy.core._exceptions._UFuncNoLoopError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U4'), dtype('float64')) -> None. I'm not sure the cause of this error, but it may be related to the new default l2_reg value which is a str instead of float.

adam2392 commented 1 year ago

Re adding unit tests:

you can experiment w/ creating a dataset that should tests the functionality you have added. For example, https://github.com/mne-tools/mne-connectivity/blob/d71b0672c1573110ac747bb8b26469ea4857e4e8/mne_connectivity/vector_ar/tests/test_var.py#L28-L86 creates a noisy dataset, which is used to make sure that certain functions are robust to noise. You can add a function here to generate the dataset. You probably want to create a dataset that is "ill-conditioned". So augment a dataset that is not ill-conditioned with a few sensors/rows that are "almost" copies of a few other rows. Copy the rows, and then add some small noise. This should result in ill-conditioned data since the new rows are almost exact copies.
Then you want to add a unit test itself. The unit tests are named def test_**. Examples are in https://github.com/mne-tools/mne-connectivity/blob/main/mne_connectivity/vector_ar/tests/test_var.py.
We use pytest, which is used heavily in all fields. You might find it helpful/educational to peruse their docs: https://docs.pytest.org/en/7.2.x/. It's a good tool for helping you develop robust scientific software.

Lmk if you have any questions!

adam2392 commented 1 year ago

I've never worked with unit tests before. What would be a good unit test to add for the new features and how do I add that? As for the current tests that are failing, I am getting the same problem across 10 tests: numpy.core._exceptions._UFuncNoLoopError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U4'), dtype('float64')) -> None. I'm not sure the cause of this error, but it may be related to the new default l2_reg value which is a str instead of float.

Re the errors you see, I think you are right. The issue is that l2_reg = 'auto' should be converted into a number. IIUC there are two cases for l2_reg == 'auto':

The condition number if below our threshold, therefore we will do CV to determine an optimal l2_reg value
The condition number does not fall below our threshold. Looking at the docstring and code, I don't think this case is covered, causing the errors. In this setting, we can either proceed with still running CV to determine the optimal l2_reg value, or we set l2_reg = 0.

So in summary:

l2_reg = None/0: no CV is done regardless of condition number. Maybe a warning is released
l2_reg = 'auto' (default now): CV is done only if condition number if high.
l2_reg = array of floats: CV is done using those array of floats regardless of condition number
l2_reg = number > 0: no CV is done, and the l2_reg is applied to all cases.

Does this seem right to you? If so, perhaps you can walk through the code and see where there might be issues? Once you write the unit-tests, those will also help in debugging the behavior in the different cases.

witherscp commented 1 year ago

Thanks for the debugging help in the last comment! I was able to fix the problem where I did not account for the case when l2_reg='auto' and data is not ill-conditioned.

I'm now having some trouble making the unit-tests. I noticed how you have a correct A matrix in your create_noisy_data() function which you can use to test whether the VAR-created A matrix is correct. I can create an ill-conditioned matrix using the below code, but how do I know the actual A matrix for this dataset?

rng = np.random.RandomState(0)
n_epochs, n_signals, n_times = 2, 4, 64
data = rng.randn(n_epochs, n_signals, n_times)
times = np.arange(n_times)
final_row = np.reshape(data[:,-1,:],(2,1,64))
illconditioned = np.concatenate((data, final_row), axis=1)
illconditioned[:,-1,:] += rng.normal(0,1e-6,128).reshape(2,-1)

Right now, my unit test would simply iterate through a set of parameters for l2_reg -- 'auto', None, 0, 0.3, np.array(0.1,1), [0.1,1] -- to make sure that no errors are encountered. But how can I know if the result is "correct"?

adam2392 commented 1 year ago

I'm now having some trouble making the unit-tests. I noticed how you have a correct A matrix in your create_noisy_data() function which you can use to test whether the VAR-created A matrix is correct. I can create an ill-conditioned matrix using the below code, but how do I know the actual A matrix for this dataset?
rng = np.random.RandomState(0)
n_epochs, n_signals, n_times = 2, 4, 64
data = rng.randn(n_epochs, n_signals, n_times)
times = np.arange(n_times)
final_row = np.reshape(data[:,-1,:],(2,1,64))
illconditioned = np.concatenate((data, final_row), axis=1)
illconditioned[:,-1,:] += rng.normal(0,1e-6,128).reshape(2,-1)

Hmm good question. Maybe you can try the following:

generate an A matrix that is ill-conditioned by: i) Maybe an easy way is to just create the eigenvalue matrix and set one of the eigenvalues to like 1e-6. Then generate random eigenvector matrices, or ii) create a lower/upper-triangular random matrix with diagonal entries <= 1 (to ensure stability) and then artificially set one of the diagonal entries to like 1e-6. Since they're triangular, the eigenvalues are exactly the diagonal entries.
generate a random x0 initial conditions vector and set 2/3 of the entries to be the same (i.e. 2/3 of the "artificial sensors" have the same starting condition).
generate forward in time
double check the condition number of the output test dataset (the generated time-series)
verify that your procedure does "well" (either compared to the non-regularized method, or just overall close to the ground-truth matrix)

Lmk if steps 1-4 end up giving you at least what you need in terms of the test dataset. / if you have any coding questions.

Right now, my unit test would simply iterate through a set of parameters for l2_reg -- 'auto', None, 0, 0.3, np.array(0.1,1), [0.1,1] -- to make sure that no errors are encountered. But how can I know if the result is "correct"?

We'll want to test this at least works on some simulated data. You also want to create a set of tests to verify that certain behavior is carried out in each case of l2. You can verify at the very minimum here that matrix is "close" to the ground-truth and/or that the model params passed to the connectivity data structure are what you might expect.

Unrelated note: I saw you work with Sara Inati? She was a collaborator when I was in my PhD. Say hi :).

witherscp commented 1 year ago

OK, I tried my best to follow your suggestions. I only needed to set one eigenvalue to 1e-6 to obtain an ill-conditioned matrix (as long as there are 12 channels in the dataset), so I did not need to set 2/3 of the starting conditions to the same value as well. Using the ill-conditioned data matrix, I was able to test out the l2_reg parameter with and without regularization.

Now the problem is that RidgeCV does not seem to outperform OLS in this sample dataset. In addition, many of the eigenvalues are dissimilar from the actual sample eigenvalues. What is your suggestion for next steps in debugging?

And yes, I do work with Sara Inati! Glad to see you have collaborated with us in the past. I'm thankful for your help on getting this improvement finished.

adam2392 commented 1 year ago

Now the problem is that RidgeCV does not seem to outperform OLS in this sample dataset. In addition, many of the eigenvalues are dissimilar from the actual sample eigenvalues. What is your suggestion for next steps in debugging?

Hmm I'll list some things to try systematically. Before you do so, it might be helpful to track the performance changes relative to your current attempt. The two metrics I have in mind are i) how closely you approximate the true A (so some type of matrix norm of the difference), and ii) how closely you approximate the set of true eigenvalues. Then you can see if any of these changes are at least moving in the right direction (i.e. RidgeCV better than OLS).

instead of simulating data with A being upper-triangular, perhaps it might help to have A not be upper-triangular. So to generate a random matrix with fixed set of eigenvalues, you just need to generate a random eigenvector matrix. Eigenvector matrices are unitary matrices (i.e. eigenvalues of 1 and some other special properties). Scipy seems to have a function for doing so: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.unitary_group.html. Also see https://mathematica.stackexchange.com/questions/228560/generate-random-matrix-with-specific-eigenvalues about generating matrices with specific set of eigenvalues.

Intuitively, this will allow the data to mix more. Perhaps rn the problem is so easy that OLS is just the better thing to do.

I would also try adding more ill-conditioning, so a few more very small eigenvalues.
Another thing to try is make some of the initial conditions exactly the same. Intuitively, A is mixing x0 (the initial conditions).
Another thing to try is to change the dimensionality (i.e. number of sensors). Usually regularization helps in higher-dimensions. Rn you have 12 sensors and 100 time points. You can play around w/ increasing n_sensors and n_timepoints. My suspicion is that n_sensors plays a greater role.

Lmk if that helps, or if you are stuck anywhere.

adam2392 commented 1 year ago

Hey @witherscp just checking in on this to see if you need any assistance?

If there is trouble figuring out a rigorous scenario where this is tested, we can always disable this by default and instead you can add a short example on some simulated, or real data (see existing examples), where this shows "improvement".

witherscp commented 1 year ago

Hey @witherscp just checking in on this to see if you need any assistance?

If there is trouble figuring out a rigorous scenario where this is tested, we can always disable this by default and instead you can add a short example on some simulated, or real data (see existing examples), where this shows "improvement".

Hey, sorry for the delay; I got sidetracked by other tasks that I have been working on. I will try to test out a few scenarios where RidgeCV outperforms OLS before opting to disable the default. If this proves to be too difficult, though, it will be easy to show an example where it is necessary (as in the case of common average reference).

Can you provide a little more help with the example you suggested of using a non-upper triangular matrix? I am not that well-versed in linear algebra, so I am started to get confused by some of these ideas. What lines of code would create a non-upper triangular matrix with multiple small eigenvalues?

mne-tools / mne-connectivity

Implement automatic ridge regression #124

PR Description

Merge checklist