Closed tsbinns closed 10 months ago
tagging @larsoner @agramfort @adam2392. This may need significant discussion to hammer out the details.
Hi @tsbinns Thanks for making the PR and offering such a substantial contribution!
In the meantime tho perhaps for my own education and to guide the discussion, I think these are a few things we'll need to discuss at a high level among your team and our team. Note these are just questions to guide discussion. Excited to move this forward! Feel free to chime in on each topic here:
MultivariateSpectralConnectivity
? We generally followed a design where we kept the connectivity classes light-weight and simply used as an API for the underlying data, while the bulk of computation should be done inside functions. Of course, if there's something I'm missing that doesn't fit into the existing connectivity objects nicely, then we can expand the scope of connectivity class objects.Hi @adam2392, thank you for the reply! Here are my thoughts on your questions:
1. Testing for "correctness" of each method
So if you are talking along the lines of defining some thresholds and checking whether the estimated connectivity values fall within this range for some simulated data (like you have in test_spectral_connectivity()
of test_spectral.py
), then we have implemented something similar in test_methods_and_modes()
of test_epochs_multivariate.py
(although we wanted to have some internal discussion before deciding on suitable thresholds for the Granger scores).
If you mean something a bit more advanced such as checking that the connectivity estimates match some simulated data with a known connectivity profile, then I do not know of anything off the top of my head. However, I am sure this is something we could figure out.
So far, we have been making sure to compare our connectivity estimates with those produced from the original MATLAB implementations of these methods as a sanity check. I realise this is not a test of "correctness" as such, but at least it confirms that these implementations match the authors' intended method. Unfortunately this sanity check is not possible in all circumstances due to the addition of new features (for example, we have added options for performing dimensionality reduction prior to estimating connectivity to reduce overfitting and enable working with non-full rank data, as well as the ability to extract spatial topographies for connectivity). However, we are confident that these additional features are working as intended.
2. Storage of the connectivity A very good point. It would be desirable to store some additional information specific to the multivariate connectivity methods (e.g. the number of lags used when computing Granger causality, the rank the data was reduced to before estimating connectivity), but of course, this would not interfere with any existing behaviour.
Something a little less trivial is the storage of any spatial topographies as these can end up being ragged object arrays (have dimensions [connections x channels x freqs (x times)], and so depending on the number of channels used in each connection, can end up ragged). For instance, when saving with HDF5, we handled this by adding the functionality to pad the object arrays such that they can be stored as regular numpy arrays, and then upadded when being loaded. Again though, this would not interfere with any existing behaviour, just add something new.
However, whilst it is currently the case that the shape of the connectivity results is identical to that in the existing classes ([connections x freqs (x times)]), this may change. For maximised imaginary coherence, the current implementation involves taking only a single (largest) connectivity component. However, it can also be of interest to inspect other components of the connectivity (as many components as the rank of the data is possible, similar to the situation with the results of mne.decoding.ssd
). This would then give results with dimensions [connections x rank x freqs (x times)]. Whilst this change has not yet been implemented, it is one we would like to make, and could make storing the results in the existing classes without affecting their fundamental behaviour challenging.
3. Maintenance longer-term Absolutely, we would be happy to maintain this code if it were added.
4. Interpretations Definitely. I agree it would be very helpful to have some examples, particularly for interpreting any spatial topographies and for the various possible ways of handling the Granger scores. We would be happy to write these up following the outline of the existing examples you have written.
Including a short paragraph or two for each of the connectivity methods in the documentation is also something we could easily add, and there are plenty of publications we could cite for those interested in further reading.
I would be interested to hear your thoughts on this. Please just let me know if you have any further questions or would like to discuss any particulars in more detail!
Hi @tsbinns thanks for your timely response! We have a dev meeting on Friday, so will get back to you after that. In the meantime, here's somethings to think about:
1. Testing for "correctness" of each method ... If you mean something a bit more advanced such as checking that the connectivity estimates match some simulated data with a known connectivity profile, then I do not know of anything off the top of my head. However, I am sure this is something we could figure out.
Yes that would be great! The existing tests you have regarding checking the correct range of values is also great.
So far, we have been making sure to compare our connectivity estimates with those produced from the original MATLAB implementations of these methods as a sanity check.
Something to think about:
Is it possible to get a few hard-coded short-examples ran through MATLAB that we can just compare? E.g. some simulation data generated that we can load in mne-connectivity during pytest
that also have had the results ran through MATLAB v1.x*. If it's small enough, then it can easily be just included in the repo itself. We can then explicitly run the simulated data thru mne-connectivity and then compare with the hard-coded results. This will help us control for regressions as we refactor the code.
2. Storage of the connectivity A very good point. It would be desirable to store some additional information specific to the multivariate connectivity methods (e.g. the number of lags used when computing Granger causality, the rank the data was reduced to before estimating connectivity), but of course, this would not interfere with any existing behaviour.
Here, we can leverage the existing **TemporalConnectivity
containers, which have an axis dedicated to "time". We can add lags here and metadata corresponding to the axis to make sure this is apparent. IIRC, I think xarray has some flexibility regarding this.
Something a little less trivial is the storage of any spatial topographies as these can end up being ragged object arrays (have dimensions [connections x channels x freqs (x times)], and so depending on the number of channels used in each connection, can end up ragged). For instance, when saving with HDF5, we handled this by adding the functionality to pad the object arrays such that they can be stored as regular numpy arrays, and then upadded when being loaded. Again though, this would not interfere with any existing behaviour, just add something new.
So it seems the main potential differences are the chance of a ragged array, and the possibility of an extra "arbitrary" dimension (e.g. rank). I don't think we want to support arbitrary lists of connectivities in a container. It sort of breaks the xarray design we have.
Re ragged array: Do you mind elaborating what is the exact scenario we might get a ragged connectivity array? I don't think I follow. If we have a connectivity over spatial topographies where each spatial point value is computed using say 2-5 channels, the end connectivity should still be over the spatial points though right (n_spatial_points, n_spatial_points)
?
In terms of storing ragged points: Perhaps we can encode the missing "connectivity" parts with np.nan? Or zeros? I guess zeros are not ideal cuz perhaps they could be 0? Then, someone can just remove the np.nan
components. We currently don't have a sparse implementation, but might think about supporting that in the future once numpy/scipy/sklearn move to generic sparse NDArrays https://github.com/pydata/sparse.
Re arbitrary extra dimension(s): Thinking out loud here... I think this could just be a generic "ExtendedConnectivity" container that allows arbitrary extension of the dimensions stored. E.g. the biggest one currently is EpochedSpectroTemporalConnectivity
, which has unraveled shape (n_epochs, n_nodes, n_nodes, n_freqs, n_times)
, so we would want some way of supporting:
(n_epochs, n_nodes, n_nodes, n_freqs, n_times, ...)
, where ...
is extra data dimensions that are defined by the user. Note this might require refactoring of the existing Connectivity classes. I think there are some implicit assumptions currently on the positioning of the time
and epochs
axes. Something like:
class ExtendedConnectivity(BaseConn):
def __init__(self, data, ..., **kwargs):
...
conn = ExtendedConnectivity(data, ..., rank=[1,2,3,4], subjects=[0,1,5,10])
print(conn.dims)
> `(n_epochs, n_nodes, n_nodes, n_freqs, n_times, rank, subjects)`
Hi @adam2392,
Is it possible to get a few hard-coded short-examples ran through MATLAB that we can just compare? E.g. some simulation data generated that we can load in mne-connectivity during pytest that also have had the results ran through MATLAB v1.x*. If it's small enough, then it can easily be just included in the repo itself. We can then explicitly run the simulated data thru mne-connectivity and then compare with the hard-coded results. This will help us control for regressions as we refactor the code.
It's definitely possible. One thing to consider is that I am having to compute the CSD in MNE and then load this into MATLAB, since we did not have the MATLAB code to replicate MNE's CSD computation procedure. This would mean no changes can be made to the way the data is simulated or the CSD is computed in MNE, otherwise someone would have to run the MATLAB computations again. Still, as long as this is kept in mind, it would not be a major problem.
Re ragged array: Do you mind elaborating what is the exact scenario we might get a ragged connectivity array? I don't think I follow. If we have a connectivity over spatial topographies where each spatial point value is computed using say 2-5 channels, the end connectivity should still be over the spatial points though right (n_spatial_points, n_spatial_points)?
So the spatial topographies have dimensions [connections x channels x freqs (x times)]. If the seeds/targets for all connections have the same number of channels, then the array will not be ragged. However, say we have 2 connections, the first connection involving 3 seed channels, and the second connection involving 5 seed channels. The spatial topographies for our first seed has shape [3 x freqs (x times)], and those for our second seed is [5 x freqs (x times)]. Combining these topographies across connections then results in a ragged array.
The same situation applies if we switched to returning more than just the largest connectivity component for the maximised imaginary part of coherency, as the connectivity results would also have dimensions [connections x channels x freqs (x times)], and so depending on the number of channels used for the seeds and targets of a given connection, the shape of the array for each connection could vary.
I also forgot to mention that the same issue applies for indices
. In the existing bivariate case, the indices of seeds/targets is given as a list[int]
. Multivariate indices can instead be specified as a list[list[int]]
, where each nested list contains the channel indices for a given seed/target. Hence, the dimensions of indices
for seeds/targets is [connections x channels], and a varying number of channels across connections would make indices
ragged (although since indices
is not stored as an array until it is saved, this only becomes an issue here).
In terms of storing ragged points: Perhaps we can encode the missing "connectivity" parts with np.nan? Or zeros? I guess zeros are not ideal cuz perhaps they could be 0? Then, someone can just remove the np.nan components. We currently don't have a sparse implementation, but might think about supporting that in the future once numpy/scipy/sklearn move to generic sparse NDArrays https://github.com/pydata/sparse.
Our approach so far has been to store the seed and target topographies as ragged arrays under the topographies
attribute of the new connectivity classes. We only convert the topographies to a homogeneous dimension state when this is required for saving. In this case, we do the following:
np.inf
(a value which should not occur in indices
) until the seed/target indices of each connection has length 5, e.g.:
where i
is np.inf
. The indices for seeds/targets then have the homogeneous dimensions [2 x 5].np.inf
are added depending on the number of "missing" channels for a given connection. Following the example, topographies for the first seed could be padded with 2 channels (before: [3 x freqs (x times)] => after: [5 x freqs (x times)]). Doing so for all seeds/targets gives arrays with homogeneous dimensions [2 x 5 x freqs (x times)].This padding ensures indices
and topographies
can be stored as regular numpy arrays, making them compatible with HDF5.
indices
are checked for padded values (np.inf
). Counting the number of np.inf
s per connection tells us how many channels we can discard from the topographies
for the seeds/targets of each connection. Discarding the padded values from indices
and topographies
, we then return to our ragged array state.Admittedly, handling ragged arrays is quite cumbersome. If you believe padding topographies
by default (e.g. with np.nan
, as 0 is indeed a valid connectivity score) would be a better approach, then certainly we could adapt the approach we take when saving as the default way of handling spatial topographies (likewise for the connectivity results if the abovementioned changes were made).
Re arbitrary extra dimension(s): Thinking out loud here... I think this could just be a generic "ExtendedConnectivity" container that allows arbitrary extension of the dimensions stored.
Yes, that would be a nice, flexible solution. If this does end up requiring some refactoring of the existing connectivity classes, I would be happy to help with a related PR :)
@tsbinns Sorry the dev meeting is actually next Friday. However, I am going to be out at that time... I may make it to the meeting, but if not the next one is in 3 weeks. I think regarding the ragged and extended dimensions requires a bit more discussion, so my apologies for the possible delay on that.
I think overall, I am optimistic we can figure out a way to make this work. Albeit not including all possible features that you may want, but perhaps having some abstractions that make it possible for you to implement on your end that depends on mne-connectivity. E.g. if you want ragged arrays, maybe we just end up supporting default padding.
However, I think regarding the rest of the code, it might be possible to begin moving forward w/ a PR if you think so? Out of the list here https://github.com/mne-tools/mne-connectivity/blob/024bc7dc54092440446e54eaa23179770c43b786/mne_connectivity/spectral/epochs_multivariate.py#L1310-L1328, are there any that can be implemented without the need for a ragged list of seeds and targets and without the need to extend the dimensionality of the connectivity? Even just having a clean PR of just one of the methods can serve as a baseline and playground for us to work out the kinks of supporting all of them. Ideally, they can also leverage one of the existing connectivity classes (for now). Lmk wdyt. I'm also happy to jump on a call about this specific PR if you have questions.
@adam2392 Regarding the dev meeting, no worries!
As for a baseline PR, yes, I think that would be possible. Only maximised imaginary coherence (MIC) would require an extension of the connectivity results dimensions (and also is the only method involving potentially ragged topographies). For all other methods, the dimensionality of the results is not changed, and so could rely on the existing connectiviy classes. Multivariate interaction measure (MIM) is the simplest in terms of the amount of code required, so that could be a good starting point.
However, all of these methods would require support for ragged indices. If this were implemented using the existing connectivity classes, everything except for saving the class should work.
Having a call to iron out some details for this example PR would be useful for me; thanks for offering! My schedule is generally quite flexible, so please let me know a date that would work for you. Also, I am based in Europe, so would you have time to meet in the morning? (otherwise it might be a bit late for me)
However, all of these methods would require support for ragged indices. If this were implemented using the existing connectivity classes, everything except for saving the class should work.
Would it make sense to implement them w/o the assumption that indices
can be a list of lists for now and then extend them all simultaneously later? I.e. does the metric still make sense w/o ragged indices list (e.g. if we compute MIM across all EEG channels)
Having a call to iron out some details for this example PR would be useful for me; thanks for offering! My schedule is generally quite flexible, so please let me know a date that would work for you. Also, I am based in Europe, so would you have time to meet in the morning? (otherwise it might be a bit late for me)
Sure, can you send me an email at adam.li at Columbia dot edu and we can sort out a time?
Would it make sense to implement them w/o the assumption that indices can be a list of lists for now and then extend them all simultaneously later? I.e. does the metric still make sense w/o ragged indices list (e.g. if we compute MIM across all EEG channels)
Actually yes, this is a valid use of MIM, so I could get started with a PR using the existing indices format.
Sure, can you send me an email at adam.li at Columbia dot edu and we can sort out a time?
Yes, I'll write you soon!
@adam2392 @tsbinns did you two work out the details here? If not, do you still need feedback on all four points from https://github.com/mne-tools/mne-connectivity/pull/125#issuecomment-1426302696 ?
@adam2392 @tsbinns did you two work out the details here? If not, do you still need feedback on all four points from https://github.com/mne-tools/mne-connectivity/pull/125#issuecomment-1426302696 ?
Currently @tsbinns is working on a PR to add the new connectivity metrics without ragged indices and without a new dimension in connectivity.
Any feedback on the ragged indices and new dimensions is appreciated!
Hi @larsoner, as @adam2392 mentioned, I am currently working on a PR which could be used as a base before the full functionality of these methods is extended to.
I have the implementation working and a fairly thorough example written, however I still had a few unit tests I wanted to add. If you would want me to already push this PR so that you can get an idea of things already, I am happy to do so, otherwise I would try to finish the unit tests first and then submit something in the next few days. Let me know what works best for you.
Cheers!
Feel free to push once you have unit tests and the example, then we can do a proper review. Then once that's merged (since it's uncontroversial), how about you open a follow-up PR with the ragged indices and extra dimensions as a proposal-by-PR? Assuming that splitting path isn't a lot of / excessive work at your end, it'll make reviewing and thinking about the other stuff easier
Hi @tsbinns just checking in to see how things are going? Feel free to open the PR for early feedback if there's some issues you wanna chat thru.
Hi @adam2392 , sorry for the delay. I got caught up with some other work, then I lost a week being off sick with the flu, and now I'm on holiday until next week :/ I have not forgotten about this though! Perhaps I can write you a detailed update when I get back. Cheers!
We have merged in #138 and now it would be good to perhaps summarize what remains to be done (if any).
I will try to make a release sometime in the next few weeks when I have time.
We have merged in #138 and now it would be good to perhaps summarize what remains to be done (if any).
I will try to make a release sometime in the next few weeks when I have time.
Thanks very much @adam2392!
The only thing which remains is adding support for connectivity computations where the number of seeds and targets differ within and across connections. I suppose the two options are: 1) add support for ragged arrays; or 2) we pad the arrays with some known value (e.g. NaN) to ensure they are not ragged. I would be very happy to discuss this further.
With this change we would have the full functionality of these multivariate methods!
The only thing which remains is adding support for connectivity computations where the number of seeds and targets differ within and across connections. I suppose the two options are: 1) add support for ragged arrays; or 2) we pad the arrays with some known value (e.g. NaN) to ensure they are not ragged. I would be very happy to discuss this further.
With option 1), I would say if there's a way to add the optional dependency https://github.com/scikit-hep/awkward and easily transform between awk, numpy and xarray, then we could do that.
Otw, option 2) seems like a good option w/ documentation and a short example of how to do this.
With option 1), I would say if there's a way to add the optional dependency https://github.com/scikit-hep/awkward and easily transform between awk, numpy and xarray, then we could do that.
Otw, option 2) seems like a good option w/ documentation and a short example of how to do this.
I am not familiar with awkward, but from a first glance it looks like this could work. In any case, I imagine avoiding additional dependencies woule be preferable?
Having played around with the two approaches, I personally lean towards padding the arrays for an end-user perspective. Handling the arrays is much simpler, and it's not at all difficult to extract the relevant information. With documentation and an example like you mention, I do not see it being a problem for users. It would also be trivial to implement.
If you like, I could prepare a draft PR with an implementation of option 2 to get a better idea of how it would practically work. Depending on this we could decide to stick with it or switch to option 1. How does that sound to you?
just a note to say that I know a couple of the awkward devs/maintainers. My take is that it would be okay to add this as a dependency --- i.e., I think it's likely to be around and supported for a while --- if option 2 (padding regular arrays with np.inf
) proves to be troublesome or slow.
Just logging my notes from last call:
cc: @larsoner, @drammock and @tsbinns if I missed anything
@adam2392 @drammock @larsoner Unless there are any outstanding issues people still feel need to be addressed, I would propose to close this PR now that the methods have been implemented in #138 and #142, and the refactoring addressed in #156.
Myself and some colleagues have been working on incorporating some multivariate connectivity methods into the MNE-connectivity package. The methods are two forms of imaginary coherence (maximised imaginary coherence, and the multivariate interaction measure; Ewald et al., 2012) and a form of multivariate Granger causality based on state-space models (Barnett & Seth, 2014, 2015).
These implementations in MNE are based on the original implementations of the methods in MATLAB (multivariate coherence code provided by the authors of the method, multivariate Granger causality code partially available here), with some additional features added such as options for dimensionality reduction to handle overfitting/non-full rank data, as well as the recovery of spatial topographies accompanying connectivity.
Whilst these implementations work in MNE, have an accompanying (albeit incomplete) suite of unit tests (
test_epochs_multivariate.py
, additions totest_utils.py
), a new utility for creating multivariate indices, as well as new classes for storing the multivariate connectivity measures (useful for e.g. allowing the storage of spatial topographies), we are not entirely satisfied with the 'cleanliness' of the code (there are also a number of code style discrepancies, incomplete documentation). Given the different considerations for handling multivariate connectivity in comparison to the existing bivariate connectivity measures, a lot of new code has needed to be added, and we have found it quite difficult to refactor the existing code used for the other connectivity measures (e.g. computation of cross-spectral densities).Some internal changes we imagined included moving the classes where connectivity is computed from
epochs.py
to a new fileepochs_classes.py
, as well as creating a new file for handling the functions where the connectivity methods are called (epochs_multivariate.py
, with a new function for calling these methodsmultivariate_spectral_connectivity_epochs()
). Without this,epochs.py
was simply becoming too large a file with the added code for the multivariate connectivity measures. Of course, we are happy to hear your thoughts on this matter.We have been working on these implementations for some time now, and we are at a stage where we are confident that the methods are working correctly (certainly from a mathematical standpoint, and also from the perspective of an end user more or less), it is now just a case of finding the correct internal implementation style, for which we would appreciate your insight. I noticed an issue was posted about the possibility of adding other multivariate methods (Issue #102); perhaps there are some common considerations we could think about to make any implementation of multivariate methods as flexible/expandable as possible.
We look forward to hearing your thoughts!