mne-tools / mne-connectivity

Connectivity algorithms that leverage the MNE-Python API.

https://mne.tools/mne-connectivity/dev/index.html

BSD 3-Clause "New" or "Revised" License

66 stars 34 forks source link

[ENH] Add support for ragged connections with multivariate methods with padding #142

Closed tsbinns closed 8 months ago

tsbinns commented 11 months ago

Following discussions in #125.

I have added support for handling 'ragged' connections (i.e. where the number of seeds and targets differs across and within connections) with the multivariate methods MIC, MIM, and state-space GC introduced in #138. This affects: the indices attribute; and the patterns attribute (associated with MIC, with values returned for each channel in the seeds and targets). The connectivity results themselves are unaffected due to the built-in dimensionality reduction, and so return a single connectivity spectrum regardless of the number of seeds and targets. I have written an example which explains the problem and the workaround in more detail as @adam2392 suggested: https://github.com/mne-tools/mne-connectivity/blob/73792c3ee6c4a2cf2b2d963f47701481b5a19f39/examples/handling_ragged_arrays.py#L2-L7

In short: ragged arrays are difficult to work with and cannot be saved using the h5netcdf engine. To get around this problem we can pad the arrays for the missing entries with some known value to make them 'full'.

Padding `indices`

For padding indices, I chose the value -1. Because the entries need to be integers to be interpretable for indexing, we cannot represent values such as np.nan or np.inf in the arrays, so I picked -1 as an invalid channel index. E.g. say we want to look at two connections, the first between 2 seeds and 3 targets, and the second between 4 seeds and 1 target; indices could have the form:

ragged_seeds = [[0, 1 ], [5, 6, 7, 8]] ragged_targets = [[2, 3, 4], [9 ]] ragged_indices = (ragged_seeds, ragged_targets)

The ragged seeds/targets thus have shape [cons, **variable**]. This can be passed to the connectivity functions, and will internally be padded to:

padded_seeds = [[0, 1, -1, -1], [5, 6, 7, 8]] padded_targets = [[2, 3, 4, -1], [9, -1, -1, -1]] padded_indices = (padded_seeds, padded_targets)

The padded seeds/targets now have shape [cons, max_channels], which can be stored in the connectivity objects. The consistent shape makes them easier to handle and also compatible with saving.

Padding `patterns`

For padding patterns, I chose the invalid value np.nan. patterns have shape [2, cons, channels, freqs (, times)] (where the 2 represents entries for the seeds and targets, respectively). Accordingly, if the number of channels differs across the seed and target within a connection, or across connections, patterns will be ragged with shape [2, cons, **variable**, freqs (, times)].

By padding the missing entries with np.nan along the channel dimension, we can make the array shape consistent with [2, cons, max_channels, freqs (, times)] (e.g. [2, 2, 4, freqs (, times)] using the indices in the above example). Extracting only those valid entries from the patterns for a given connection is very simple, as I show in the example linked above.

Other changes

In line with these changes, I updated the relevant unit tests in test_spectral.py, and also edited the multivariate connectivity examples to take advantage of the new support for ragged connections. I also included two new helper functions check_multivariate_indices and seed_target_multivariate_indices (analgous to the existing functions but just for the new multivariate indices format; also with unit tests).

Overall I think this approach is quite clean: end-users can give ragged indices to a helper function which will do the padding, or pass them directly to the connectivity methods which will handle everything. Extracting the relevant information from padded indices and patterns is also trivial. Finally, the internal implementation is simple, so maintenance should not be problematic.

Please let me know your thoughts! Thomas

tsbinns commented 11 months ago

@adam2392 I updated the documentation for the util functions: check_indices; check_multivariate_indices; seed_target_indices; and seed_target_multivariate_indices. Is this the sort of thing you had in mind?

tsbinns commented 10 months ago

@adam2392 Thanks very much for the detailed feedback; I have made some updates accordingly.

Also, after this is merged, as we discussed in https://github.com/mne-tools/mne-connectivity/pull/125 are you interested in joining the maintenance team and getting core-dev rights to mne-connectivity?

Yes please, I am very happy to help out and make sure this can be maintained!

adam2392 commented 10 months ago

I think this mostly LGTM, but I have a question regarding indices that I didn't spot earlier. The unit-test you added in test_spectral.py should be refactored to maybe use pytest.mark.parametrize:

@pytest.mark.parametrize("method", ["coh", "plv", "pli", "wpli", "ciplv", "mic", "mim"])
@pytest.mark.parametrize("indices", [None, 
                                     (np.array([0, 1]), np.array([2, 3])),
                                     (np.array([[0, 1]]), np.array([[2, 3]]))
                                     ])
def test_spectral_connectivity_indices_roundtrip_io(tmp_path, method, indices):
    """Test that indices values and type is maintained after saving.

    If `indices` is None, `indices` in the returned connectivity object should
    be None, otherwise, `indices` should be a tuple. The type of `indices` and
    its values should be retained after saving and reloading.
    """
    rng = np.random.RandomState(0)
    n_epochs, n_chs, n_times, sfreq, f = 5, 4, 2000, 1000.0, 20.0
    data = rng.randn(n_epochs, n_chs, n_times)
    sig = np.sin(2 * np.pi * f * np.arange(1000) / sfreq) * np.hanning(1000)
    data[:, :, 500:1500] += sig
    info = create_info(n_chs, sfreq, "eeg")
    tmin = -1
    epochs = EpochsArray(data, info, tmin=tmin)
    freqs = np.arange(10, 31)
    tmp_file = os.path.join(tmp_path, "foo_mvc.nc")

    # for this_indices, this_method in zip(indices, methods):
    con_epochs = spectral_connectivity_epochs(
        epochs, method=method, indices=indices, sfreq=sfreq, fmin=10, fmax=30
    )
    con_time = spectral_connectivity_time(
        epochs, freqs, method=method, indices=indices, sfreq=sfreq
    )

    for con in [con_epochs, con_time]:
        con.save(tmp_file)
        read_con = read_connectivity(tmp_file)
        if indices is not None:
            # check indices of same type (tuples)
            assert isinstance(con.indices, tuple) and isinstance(
                read_con.indices, tuple
            )
            # check indices have same values
            assert np.all(np.array(con.indices) == np.array(read_con.indices))
        else:
            assert con.indices is None and read_con.indices is None

However, this test fails.

Discussion on why unit-test fails

We have indices defined in two ways, which to me is not easy to remember:

bivar_indices = (np.array([0, 1]), np.array([2, 3]))
multivar_indices = (np.array([[0, 1]]), np.array([[2, 3]]))

I would expect that if someone forgets to pass in a 2D array, they accidentally perform bivariate indices. I'm jw if it's possible to infer this behavior in anyway? If not, I think the docstring can maybe be a bit more clear on what indices to expect. I.e. perhaps a tuple of 1D arrays, tuple of 2D arrays | None? WDYT?

indices : tuple of array | None
        Two arrays with indices of connections for which to compute
        connectivity. If a multivariate method is called, each array for the
        seeds and targets should contain a nested array of channel indices for
        the individual connections. If None, connections between all channels
        are computed, unless a Granger causality method is called, in which
        case an error is raised.

So IIUC, mic/mim will not work when bivariate indices are passed. And bivariate methods will not work when multivariate indices are passed. However, the current error messages are rather uninformative for bivariate methods:

____________________ test_spectral_connectivity_indices_roundtrip_io[indices2-wpli] ____________________
mne_connectivity/spectral/tests/test_spectral.py:1724: in test_spectral_connectivity_indices_roundtrip_io
    con_epochs = spectral_connectivity_epochs(
<decorator-gen-623>:12: in spectral_connectivity_epochs
    ???
mne_connectivity/spectral/epochs.py:1996: in spectral_connectivity_epochs
    _epoch_spectral_connectivity(data=this_epoch, **call_params)
mne_connectivity/spectral/epochs.py:1418: in _epoch_spectral_connectivity
    method.accumulate(con_idx, csd)
mne_connectivity/spectral/epochs.py:818: in accumulate
    self._acc[0, con_idx] += im_csd
E   ValueError: non-broadcastable output operand with shape (1,41) doesn't match the broadcast shape (1,2,41)

The multivariate one looks good:

    con_epochs = spectral_connectivity_epochs(
<decorator-gen-623>:12: in spectral_connectivity_epochs
    ???
mne_connectivity/spectral/epochs.py:1891: in spectral_connectivity_epochs
    n_signals, indices_use, warn_times) = _prepare_connectivity(
mne_connectivity/spectral/epochs.py:116: in _prepare_connectivity
    indices_use = check_multivariate_indices(indices)  # pad with -1
mne_connectivity/utils/utils.py:134: in check_multivariate_indices
    raise TypeError(
E   TypeError: multivariate indices must contain array-likes of channel indices for each seed and target

Lmk if there's anything I can clarify

tsbinns commented 10 months ago

So if a tuple of 1D arrays (bivariate indices) is passed when a multivariate method is called, I suppose it would be valid (and simple) to infer this as the indices for a single, multivariate connection. In the test above we could then use (np.array([0, 1]), np.array([2, 3])) as the indices for both the bivariate and multivariate methods.

If, however, a tuple of 2D arrays (multivariate indices) is passed when a bivariate method is called, I don't think it's valid to infer the indices from this, and an error should be raised. I would change it so the obscure error message is clearer for this scenario.

How does this sound to you @adam2392? If you're happy with the approach I would go ahead and also adapt the unit tests for this.

adam2392 commented 10 months ago

So if a tuple of 1D arrays (bivariate indices) is passed when a multivariate method is called, I suppose it would be valid (and simple) to infer this as the indices for a single, multivariate connection. In the test above we could then use (np.array([0, 1]), np.array([2, 3])) as the indices for both the bivariate and multivariate methods.

Is this because the multivariate methods will never do a full bivariate connectivity? Or could this simply reduce to the computing of connectivity among each pair of channels?

If, however, a tuple of 2D arrays (multivariate indices) is passed when a bivariate method is called, I don't think it's valid to infer the indices from this, and an error should be raised. I would change it so the obscure error message is clearer for this scenario.

Sure this sounds good pending the discussion of the above point.

tsbinns commented 10 months ago

Is this because the multivariate methods will never do a full bivariate connectivity? Or could this simply reduce to the computing of connectivity among each pair of channels?

The multivariate methods would accept channel pairs, but doing so makes little sense as there's nothing for these methods to optimise. E.g. if you have one seed and one target and want to look at imaginary part of coherency, you should use iCoh. Admittedly MIC and MIM should return the same result as iCoh, but it's complete overkill to use the multivariate methods in this case. The same would apply for Granger causality if there was an equivalent bivariate method implemented. I would also say that the expectation with multivariate methods is that you are dealing with multiple seeds and multiple targets at once.

Of course, treating this bivariate indices format as indices for channel pairs makes for a more consistent interaction with the end-user. I.e. if you give the indices ([0, 1], [2, 3]), you should get the results for connections 0-2 and 1-3 whether you call a bivariate or multivariate method. Perhaps this matters more...

Let me know which one you think works best and I will adapt the behaviour accordingly.

adam2392 commented 10 months ago

The multivariate methods would accept channel pairs, but doing so makes little sense as there's nothing for these methods to optimise. E.g. if you have one seed and one target and want to look at imaginary part of coherency, you should use iCoh. Admittedly MIC and MIM should return the same result as iCoh, but it's complete overkill to use the multivariate methods in this case. The same would apply for Granger causality if there was an equivalent bivariate method implemented. I would also say that the expectation with multivariate methods is that you are dealing with multiple seeds and multiple targets at once.

I see. Okay, I guess as long as this is properly documented in our API, this will make sense.

If you want to add the error message in the bivariate connectivity when multivariate indices are added, that should be sufficient.

tsbinns commented 10 months ago

I added the following check/error message in check_indices, so the issue of an obscure error should not occur if indices is mistakenly passed as a tuple of arrays of arrays when calling bivariate methods: https://github.com/mne-tools/mne-connectivity/blob/f837889ca0175464e05a230617c713309d89b517/mne_connectivity/utils/utils.py#L79-L81

This necessitated I add some checks into check_multivariate_indices (I was previously relying on check_indices for this, but that is no longer possible given the above change): https://github.com/mne-tools/mne-connectivity/blob/f837889ca0175464e05a230617c713309d89b517/mne_connectivity/utils/utils.py#L132-L137

I have added checks for this behaviour in the appropriate unit tests: check_indices https://github.com/mne-tools/mne-connectivity/blob/f837889ca0175464e05a230617c713309d89b517/mne_connectivity/tests/test_utils.py#L75-L78

check_multivariate_indices https://github.com/mne-tools/mne-connectivity/blob/f837889ca0175464e05a230617c713309d89b517/mne_connectivity/tests/test_utils.py#L94-L104

Finally, I updated the docstrings of spectral_connectivity_epochs and spectral_connectivity_time to try and make the expected structure of bivariate and multivariate indices even clearer: https://github.com/mne-tools/mne-connectivity/blob/f837889ca0175464e05a230617c713309d89b517/mne_connectivity/spectral/epochs.py#L1570-L1578

tsbinns commented 10 months ago

@drammock Thanks for the feedback! If you think there are things which could still be improved, I'm happy to take the time to make sure this is done right. I'll get in touch via email to organise a meeting. Cheers!

tsbinns commented 8 months ago

@adam2392 @drammock Following our previous discussion there were two main points:

Refactor the multivariate methods into new, multivariate specific functions.
Replace ragged arrays with sparse arrays.

Refactor the multivariate methods into new, multivariate specific functions.

Regarding the refactoring, I have just pushed a suggestion for the spectral_connectivity_epochs function. The existing epochs.py file was getting very hard to follow (> 2,000 lines), so adding more to this file seemed problematic to me. Because of this, I tried to identify those features which are required by both bivariate and multivariate methods, and retained these in epochs.py. For the remaining code, I divided this into two files: epochs_bivariate.py and epochs_multivariate.py, where the functions spectral_connectivity_epochs and spectral_connectivity_epochs_multivariate and all code specific for the bivariate and multivariate methods are stored. I updated the unit tests and examples accordingly.

https://github.com/mne-tools/mne-connectivity/blob/238c0de75b600506ae333d7c7667eb028ea8e61b/mne_connectivity/spectral/epochs_bivariate.py#L372-L379

https://github.com/mne-tools/mne-connectivity/blob/238c0de75b600506ae333d7c7667eb028ea8e61b/mne_connectivity/spectral/epochs_multivariate.py#L775-L782

Your feedback on this is very welcome. Based on this I would also make the appropriate changes to spectral_connectivity_time. I hope at least this gives you an idea of what having separate bivariate and multivariate functions looks like.

Replace ragged arrays with sparse arrays.

How to handle ragged arrays is something I am less certain of. I summarise the problem again here:

With all multivariate methods, indices can be ragged, and with max. imag. coherency (MIC), patterns can also be ragged (both have shape of (2, n_cons, n_chans), where n_chans can vary).
Saving connectivity objects uses Xarray. indices and patterns are converted to numpy arrays before saving, but ragged arrays can only be stored with dtype=object. The h5netcdf engine is used to save the object, but this does not supprt numpy object arrays.

My current workaround was to pad indices and patterns to be 'full' arrays (-1 for indices and np.nan for patterns). There are two separate issues with this approach:

-1 is technically a valid channel index, however invalid entries such as np.nan and np.inf are floats which cannot be represented in integer arrays (required to use these as an index for channels).
Padding patterns can potentially bloat memory usage (padding indices could also bloat memory, but to a lesser extent).

There were a few suggestions for how to better handle this:

Use the pandas nullable integer dtype to pad indices - this does not work, as pandas arrays can only be 1D. If you try to nest these within the numpy arrays of indices, it is represented with dtype=object, so we are unable to save this.
Use SciPy sparse arrays - this would help for the memory bloat of padded arrays, but it does not address the issue of requiring an 'invalid' integer to pad indices with in the first place. I also do not know if sparse arrays can be saved with the h5netcdf engine.
Use NumPy masked arrays - as far as I understand, masked arrays still require some fill_value, so again we need an 'invalid' integer for padding indices. I also do not know if sparse arrays can be saved with the h5netcdf engine.
Use Awkward arrays - I did not have the time to look into this package yet, however my concern again would be compatibility with h5netcdf.
Finally, @drammock suggested changing how indices is represented internally. There are many possible approaches here. E.g. the approach I am using in my personal MNE-Connectivity fork where I am using these methods is to allow the user to pass in indices as a ragged array (can be handled as numpy object arrays), and store this in the connectivity object. As the only problem with object arrays comes when saving, I only pad indices when saving. We can even pad with a true invalid index like np.nan and convert to an array of floats (since of course the indices stored in the file will not be used for indexing, so do not need to be integers). Next, when the user reloads the connectivity object, we can unpad indices back into an object array by removing the np.nan entries and converting back to integers. Naturally this is just one possibility.

As a final note, I would consider padding indices as the more immediate issue, since I anticipate this has a bigger impact on the end-user experience than a reduced memory usage with adding sparse functionality for patterns (e.g. I never ran into any issues with memory in my time using this). I would therefore suggest to focus on handling ragged indices first, but your input is very appreciated.

Apologies it has taken quite some time for the update and that I cannot share anything more concrete, but unfortunately my time to work on this PR has been limited in the last few weeks. I still very much look forward to moving forward with the PR though!

larsoner commented 8 months ago

Sparse

Use SciPy sparse arrays - this would help for the memory bloat of padded arrays, but it does not address the issue of requiring an 'invalid' integer to pad indices with in the first place. I also do not know if sparse arrays can be saved with the h5netcdf engine.

Okay so starting from:

indices can be ragged ... shape of (2, n_cons, n_chans), where n_chans can vary

My thinking was that internally the representation of this would be a scipy.sparse.csr_matrix with shape (2 * n_cons, max(n_chans)). Each row in the csr_matrix would have the first n_chans[ii] columns with defined values. The remaining max(n_chans) - n_chans[ii] values in a given row would not be present in the sparse array, and would therefore function the same way as your "invalid value" would, i.e., "invalid" here is just absence.

Saving connectivity objects uses Xarray.

Saving a CSR matrix is easy once you conceptualize a way to save its underlying data -- you can save the data, indices, indptr, and shape (see fifth construction pattern here) -- it's how h5io writes and reads to hdf5 for example.

So to me the sparse approach still seems tractable.

Current approach

My current workaround was to pad indices and patterns to be 'full' arrays (-1 for indices and np.nan for patterns). There are two separate issues with this approach:

Coming back to this simple existing approach, though -- what are the largest values you'd expect for n_cons and max(n_chans) for the indices? Have we thought about whether the memory usage actually matters in practice? Apologies if so, I haven't looked at existing discussion so if I missed it let me know and I'll look back. Taking for example if we knew the largest in practice would probably be something like (2, 1000, 1000) (1000 connections, up to 1000 channels) -- storing this many 64-bit integers only takes 16 MB of memory (2 * 1000 * 1000 * 8) so we should live with the -1 padding and move on I think. Or live with 32-bit int and cut usage in half.

And even for a sparse representation if most n_chans are close to 1000 / "full" anyway the problem gets worse because you store not just the data but also indptr and indices. Rule of thumb memory-wise would be if more than half the array needs to be represented, you might as well use a "invalid" value or masked array.

And for this:

-1 is technically a valid channel index

It doesn't really matter I think -- we're talking about an internal representation never seen by the user so as long as we convert all negative indices a user passes into positive ones (totally reasonable thing to do and document) then this approach is okay. And on saving if you any compression in Xarray saving it should find these -1's and squash them to almost nothing on disk anyway.

Masked array

This will probably have the same underlying memory usage as the dense version (maybe worse because you have to store a bool mask, too). We can go to using it if it helps logistically, though. Fill value of -1 seems fine here just like for the current approach.

drammock commented 8 months ago

Have we thought about whether the memory usage actually matters in practice? Apologies if so, I haven't looked at existing discussion so if I missed it let me know and I'll look back.

to my mind the real advantage of using a sparse array (which IIRC is natively supported in netcdf5 files) is that you can eliminate the internal code that did the padding/unpadding (which for me at least, was confusing when I reviewed it). I agree that the storage overhead is probably acceptably low.

larsoner commented 8 months ago

Ahh... in that case it's possible that a masked array approach would be more readable in the end. And if the fill_value=-1, then the storage of the dense array with -1's at the Xarray save end should be fine. I imagine this would end up being more readable than sparse array mangling in practice.

@tsbinns would it help if I looked at examples/handling_ragged_arrays.py, ran it, and looked to see how this example would change if we used a sparse or masked representation? Knowing that memory/storage is not a big concern I'm thinking a masked array some of those operations would be more natural. Someone needs to test with code to check, though. You're welcome to do it if you want but in case you'd prefer that someone else to invest some coding time rather than just discussing/thinking in the abstract I can give it a shot, too :)

tsbinns commented 8 months ago

Thanks both for the quick replies!

Coming back to this simple existing approach, though -- what are the largest values you'd expect for n_cons and max(n_chans) for the indices? Have we thought about whether the memory usage actually matters in practice? Apologies if so, I haven't looked at existing discussion so if I missed it let me know and I'll look back. Taking for example if we knew the largest in practice would probably be something like (2, 1000, 1000) (1000 connections, up to 1000 channels) -- storing this many 64-bit integers only takes 16 MB of memory (2 1000 1000 * 8) so we should live with the -1 padding and move on I think. Or live with 32-bit int and cut usage in half.

And even for a sparse representation if most n_chans are close to 1000 / "full" anyway the problem gets worse because you store not just the data but also indptr and indices. Rule of thumb memory-wise would be if more than half the array needs to be represented, you might as well use a "invalid" value or masked array.

@larsoner I agree that for indices, memory usage will not be an issue when padding, even when dealing with 1,000 connections as you say (which I think is far beyond a normal use case).

And for this:

-1 is technically a valid channel index

It doesn't really matter I think -- we're talking about an internal representation never seen by the user so as long as we convert all negative indices a user passes into positive ones (totally reasonable thing to do and document) then this approach is okay. And on saving if you any compression in Xarray saving it should find these -1's and squash them to almost nothing on disk anyway.

to my mind the real advantage of using a sparse array (which IIRC is natively supported in netcdf5 files) is that you can eliminate the internal code that did the padding/unpadding (which for me at least, was confusing when I reviewed it). I agree that the storage overhead is probably acceptably low.

@larsoner @drammock Apologies, this I misunderstood from our previous conversation. The sparse or masked approach makes much more sense to me now.

Example pipeline for masked array support

Ahh... in that case it's possible that a masked array approach would be more readable in the end. And if the fill_value=-1, then the storage of the dense array with -1's at the Xarray save end should be fine. I imagine this would end up being more readable than sparse array mangling in practice.

Just to confirm, if I implemented masked array support, the pipeline would look like:

the end user passes in a ragged array (e.g. a numpy object array) indices = ([[1, 2 ]], [3, 4, 5]])
we convert any negative channel indices to the corresponding positive-valued indices
we perform an initial array padding with -1 and convert to a masked array, e.g. indices = ([[1, 2, -1]], [[3, 4, 5]]) indices = ma.masked_values(indices, -1)
whenever we need to index channels, we get only the valid channel indices (avoids repeated need to explicitly unpad the indices) seeds = indices[0, con_i].compressed(), i.e. seeds = [1, 2]
when we save the connectivity object, we extract the dense array save_indices = indices.data
when we load the connectivity object, we convert the dense array back to a masked array indices = ma.masked_values(indices, -1)

To me this indeed seems very readable, even as someone with no experience of using this sort of masking.

would it help if I looked at examples/handling_ragged_arrays.py, ran it, and looked to see how this example would change if we used a sparse or masked representation? Knowing that memory/storage is not a big concern I'm thinking a masked array some of those operations would be more natural. Someone needs to test with code to check, though. You're welcome to do it if you want but in case you'd prefer that someone else to invest some coding time rather than just discussing/thinking in the abstract I can give it a shot, too :)

@larsoner Thank you very much for the offer. If it's masked arrays and the above pipeline is appropriate, I am very happy to add this myself. SciPy sparse arrays however are much less clear to me, so I would ask for some help. Perhaps I first try with the masked approach and then we assess if sparse arrays are required instead?

larsoner commented 8 months ago

Perhaps I first try with the masked approach and then we assess if sparse arrays are required instead?

Based on your proposed order/set of operations above yes it seems to be pretty clean and this makes sense as a next step to me. @drammock agreed?

drammock commented 8 months ago

Based on your proposed order/set of operations above yes it seems to be pretty clean and this makes sense as a next step to me. @drammock agreed?

agreed

tsbinns commented 8 months ago

Sorry for the delay, but I've now added support for indices as masked arrays.

From the steps I outlined earlier:

the end user passes in a ragged array (e.g. a numpy object array) indices = ([[1, 2 ]], [3, 4, 5]])
we convert any negative channel indices to the corresponding positive-valued indices https://github.com/mne-tools/mne-connectivity/blob/921cf9db82fc617c0702fac3f56529ef54588320/mne_connectivity/utils/utils.py#L167-L177
we perform an initial array padding with -1 and convert to a masked array, e.g. indices = ([[1, 2, -1]], [[3, 4, 5]]) indices = ma.masked_values(indices, -1) https://github.com/mne-tools/mne-connectivity/blob/921cf9db82fc617c0702fac3f56529ef54588320/mne_connectivity/utils/utils.py#L179-L190
whenever we need to index channels, we get only the valid channel indices (avoids repeated need to explicitly unpad the indices) seeds = indices[0, con_i].compressed(), i.e. seeds = [1, 2], e.g. https://github.com/mne-tools/mne-connectivity/blob/921cf9db82fc617c0702fac3f56529ef54588320/mne_connectivity/spectral/epochs_multivariate.py#L69-L70
when we save the connectivity object, we extract the dense array save_indices = indices.data Turns out this is unnecessary as the masked array gets converted to a standard array anyway.
when we load the connectivity object, we convert the dense array back to a masked array indices = ma.masked_values(indices, -1) https://github.com/mne-tools/mne-connectivity/blob/921cf9db82fc617c0702fac3f56529ef54588320/mne_connectivity/io.py#L56-L60

Overall I would say this approach works nicely, and you can see how much simpler the built-in methods of masked arrays makes handling the indices from the ragged arrays example.

tsbinns commented 8 months ago

I would like to mention the point @drammock raised earlier regarding the helper function check_multivariate_indices:

Does it make sense for this to be public? Normally I think of check functions as "stuff we do internally to make sure the user input conforms to API expectations".

check_multivariate_indices is handling the padding and masking, and to try and keep the codebase clean, I didn't want to have a public facing function that just does some checks and a private function which performs checks and also does the masking.

seed_target_multivariate_indices already serves the function of checking seed/target values and putting them into a format accepted by the connectivity functions (a numpy object dtype array). Also, since check_multivariate_indices is returning masked arrays instead, I'm worried this could confuse users into thinking e.g. that they should be passing masked arrays into the connectivity functions.

I made it public initially because the bivariate equivalent check_indices was public, but now I am inclined to keep this as an internal check as @drammock suggested. Thoughts?

tsbinns commented 8 months ago

Also I'm stumped as to why the CI checks keep failing for linux py3.9 specifically. There is a codespell error for references.bib which I tried to fix by adding to ignore_words.txt, but that hasn't worked.

Any help is appreciated!

drammock commented 8 months ago

Also I'm stumped as to why the CI checks keep failing for linux py3.9 specifically. There is a codespell error for references.bib which I tried to fix by adding to ignore_words.txt, but that hasn't worked.

Any help is appreciated!

Try putting Manuel in all lowercase in the ignore_words.txt, cf https://github.com/codespell-project/codespell/issues/2137 (TLDR: entries in ignore_words.txt must match the way the entry is written in codespell's internal dictionary not how they appear in the code files)

tsbinns commented 8 months ago

Thanks @drammock, that fixed the codespell error!

Unfortunately seems there is a separate issue again for python 3.9 with conda; will look properly tomorrow.

Also an error building the docs in cwt_spectral_connectivity related to MNE-Python's plot_topo function: https://app.circleci.com/pipelines/github/mne-tools/mne-connectivity/765/workflows/2913fba2-9d82-4753-8f42-83b10e52a358/jobs/829

larsoner commented 8 months ago

Fixed CIs separately in #154 and merged main into your branch, hopefully that gets everything green 🤞

tsbinns commented 8 months ago

Thanks @larsoner, all green!

Then for me the outstanding questions would be:

Should check_multivariate_indices be public or private? https://github.com/mne-tools/mne-connectivity/pull/142#issuecomment-1781912510
Is my refactoring approach for spectral_connectivity_epochs appropriate? I would then apply the same for spectral_connectivity_time. https://github.com/mne-tools/mne-connectivity/pull/142#issuecomment-1777651595

larsoner commented 8 months ago

Should check_multivariate_indices be public or private?

My vote is to start with it private and make it public later if needed, but hopefully YAGNI. It's much easier (trivial-ish) to make a private thing public. Making a public thing private is hard and annoying for end users. If it's not needed in the example(s) that you added (or can be removed from those without sacrificing much) then that also suggests it doesn't need to be public.

Is my refactoring approach for spectral_connectivity_epochs appropriate? I would then apply the same for spectral_connectivity_time.

It looks like you're talking about taking a multi-thousand-line file and splitting it into multiple files. In general yes this helps maintainability, even though it will kill git blame a bit (and can't be worked around with git-blame-revs-ignore as far as I know). But doing it in this PR to add multivariate/ragged support makes reviewing difficult :disappointed: So I see two options.

1. Open a refactoring-only PR first

First open a new PR that only moves code from the single monolithic file to the split file(s), plus whatever import statements are needed to glue everything together. This PR will be trivial to review because it should be a no-op from the code/API/functionality standpoint, so we can (hopefully!) trust the tests and flake CIs to tell us if something is wrong knowing all you did was copy-paste and then adapt import statements. Once that's merged...
Rebase this PR or merge main into this branch :scream:

It might be a bit onerous from a git standpoint but it will make reviewing this multivariate stuff easier, because we will see only what's necessary to add that functionality.

If you end up needing help with step (2) above, I'm happy to volunteer to help, as I've had to get through some tough rebases and merges before. FWIW I've already "backed up" your current state locally and on my fork:

git checkout -b pr-mvc_padding tsbinns/pr-mvc_padding
git push origin pr-mvc_padding

2. Revert changes and split files later

Another approach would be to:

Undo the splitting in this PR by copy-pasting stuff back where it was. In the end we squash+merge so as long as the diff on GitHub doesn't show huge code block moves then it'll be okay. Then this PR is once again reviewable as all changes are relevant for the ragged/multivariate stuff. Then once this PR is merged...
Open a new PR to split into multiple files.

drammock commented 8 months ago

Open a refactoring-only PR first [... or] Revert changes and split files later

The refactoring was my suggestion at our last meeting (motivated by the difference in the function signatures for multivariate vs univariate connectivity measures). Just commenting to say that since it was my request, I'm also experienced with and happy to help with PR splitting / rebasing / etc. Just say the word.

tsbinns commented 8 months ago

Thanks both for the comments & offers to help! I have to admit I'm not so experienced with PR splitting/rebasing, so for me option 2 (reverting to a pre-refactoring state and re-implementing the masked arrays there) sounds simpler. I'm not sure how exactly I would approach option 1...

My thought would also be that if we are happy with the changes related to support for ragged connections/masked arrays, then perhaps we could focus on merging this first (would also address some issues in the current dev branch, e.g. #153). Afterwards, we can focus on the refactoring, which I imagine would require a lot more time and discussion to get right. What do you think?

tsbinns commented 8 months ago

Sorry that the commit history is such a mess, but I have reverted all the refactoring changes. Everything now relates only to the support for multiple connections with the multivariate methods, with the use of masked arrays for handling ragged indices.

I have also made check_multivariate_indices private following the above discussion. https://github.com/mne-tools/mne-connectivity/blob/d6ba398de507204c681a4a1c3de255d05020ac44/mne_connectivity/utils/utils.py#L86-L87

I believe all points relating to how multivariate indices are handled have now been addressed.

drammock commented 8 months ago

In it goes! Thanks a million @tsbinns !!

adam2392 commented 8 months ago

Amazing work @tsbinns !

mne-tools / mne-connectivity

[ENH] Add support for ragged connections with multivariate methods with padding #142

Padding indices

Padding patterns

Other changes

Discussion on why unit-test fails

Refactor the multivariate methods into new, multivariate specific functions.

Replace ragged arrays with sparse arrays.

Sparse

Current approach

Masked array

Example pipeline for masked array support

From the steps I outlined earlier:

1. Open a refactoring-only PR first

2. Revert changes and split files later

Padding `indices`

Padding `patterns`