[ENH] Adding STC classes that support frequency and trial dimensions

mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python

https://mne.tools

BSD 3-Clause "New" or "Revised" License

2.71k stars 1.32k forks source link

[ENH] Adding STC classes that support frequency and trial dimensions #10721

Open alexrockhill opened 2 years ago

alexrockhill commented 2 years ago

Talking to @britta-wstnr about the options for time-frequency resolved source estimates (and also trial-resolved source estimation ideally for forward extensibility) here is what we came up with:

Here is the problem:

If you want to resolve source estimates over frequencies as well as time (and trials), you get a list of stcs (or even a list of lists of stcs) which contain a lot of redundant information about the forward operator etc. and are a mess to check that the stc attributes all match every time
Adding dimensions to the stc data object would be the easiest but would break backward compatibility for people using the public attribute stc.data

Here is the proposed solution:

Make three new classes for surface, volume and mixed source estimates that have a dimension for frequencies and trials (could be just a flat dimension if it's not used) and show how to assemble from beamformers in an example

Avoiding potential issues:

Returning a 4D stc object directly from the beamformer code probably is not a good idea as it is likely not feasible in RAM for most setups to compute it all at once so it probably makes more sense to allow the user to assemble the data in a way that most efficiently trades off between memory use and speed

We couldn't think of any other future extension dimensions that would be good to include since this proposal is to make new classes but if there are things, we could consider including them. Connectivity comes to mind but I think should be outside the main MNE-Python repository. TFR seems right in the MNE-Python main repository wheelhouse so I think this would be nice to incorporate so that the visualization GUI can take in an object with all the necessary info and the user doesn't have to cobble it together. What do people think?

cc @agramfort @larsoner @jasmainak related to GSoC

drammock commented 2 years ago

support for frequency dimension within STCs is already on the roadmap: https://mne.tools/dev/overview/roadmap.html#time-frequency-classes and work is underway on that goal in #10184 (so far only for sensor-space containers though)

...but that doesn't cover including a dimension for trial

alexrockhill commented 2 years ago

Thanks for connecting that document and PR @drammock! What do you think about the idea that it might be assembled more modularly than a Spectrum class for instance? The analogy is that if the data were too large, you might need to compute the psd per channel and then have the user assemble them all into a spectrum class via the method that best utilizes their RAM. Obviously, that's not the case for PSDs so the API might be different. Curious to hear what you think the best solution to this is.

drammock commented 2 years ago

if the data were too large, you might need to compute the psd per channel and then have the user assemble them all into a spectrum class via the method that best utilizes their RAM

I guess my gut reaction is that if the data is too large to generate the spectrum/spectrogram in-memory, then users should consider either (1) a lower-density source space, or (2) fewer timepoints/frequency bins (AKA downsample / decimate). If neither of those are possible, maybe the pipeline shouldn't be running on a laptop? I know it sounds a bit harsh, but some datasets/analyses are big enough to start to look like "big data" and I'm uncertain how far we should go to accommodate running such jobs on under-resourced hardware. The more magic we do under the hood, the harder that code becomes to maintain.

(side note: FFT and an MNE or dSPM inverse are all linear operations, so you can do FFT on sensors before applying inverse and get the same answer as doing inverse first and then FFT, but FFT first is much more efficient because in typical cases you're doing FFT on ~300 channels instead of ~20k vertices.)

As far as across trials, we have some support for a return_generator kwarg (e.g., STCs from Epochs); my instinct would be to keep supporting something like that.

alexrockhill commented 2 years ago

I disagree, you could have use cases where you want time-frequency decomposed data over 72 hours of continuous data recording sampled at 1 kHz for 100+ channels, it just isn't practically feasible yet to have this kind of data in RAM even if you have a lot of money for hardware resources. Practically, I've found when working with large datasets, it's much more efficient to loop over something like channels where each one is independent or the dependence can be done in one step such as the solving the leadfield.

As a very practical example, let's say you have a typical 2 GB file, if you were to do a TFR with 100 frequencies and trial resolution, that's 200 GB of RAM which is just really impractical relative to looping over channels and keeping the max RAM usage around 2 or 4 GB with the tradeoff of having to read/write to disk more. I think, in practice, you're often going to decimate to keep less then 200 GB on disk but you could also use the results right away rather than storing to disk which can be practical when the analysis can be done for each channel independently.

britta-wstnr commented 2 years ago

Let me chime in as well with a few assorted thoughts:

Background: We touched upon this subject coming from the visualization side (e.g., visualizing space, frequency, and time interactively for source projections), and then extended our musings towards other use-cases (e.g. trials).

Regarding visualization, an array of data would make more sense than a list/generator. One possibility could thus be, to assemble that immediately before or during the call to the viz GUI from the list/generator. This would leave any methods that return such objects alone (e.g. apply_dics).

However, so we thought, it could also make sense to change the class such methods return - from my experience, the lists are not great to work with, need more memory than necessary (due to redundancy), and usually have to be manually transferred into an array for further usage (e.g., decoding or so) anyway.

Now, I tend to agree with @drammock that (at least for now) we could assume that the user has enough memory to do such operations. Alternatively, we could leave the methods that work across trials alone - and let them return a generator (or have the option to let them return a generator), which would fix part of this issue?

From the (GSOC) visualization perspective, we should probably just focus on making sure that we use future-proof input and I think for that @drammock can give us useful hints - so that we can also make sure that sensor and source space objects behave similarly.

drammock commented 2 years ago

However, so we thought, it could also make sense to change the class such methods return - from my experience, the lists are not great to work with, need more memory than necessary (due to redundancy), and usually have to be manually transferred into an array for further usage (e.g., decoding or so) anyway.

all good points. For the time-frequency visualization use case, there is no question in my mind that a new class is needed (it's already on the roadmap). But I'm -0.5 on also trying to tackle the across-trials interactive visualization problem at the same time.

britta-wstnr commented 2 years ago

@drammock Yes, I see your concerns and I think it'd be great to focus the effort as much as possible. It would still be great to write the visualization with already anticipating coming changes to the input (i.e. stc) though - to make upcoming adaptions as smooth as possible.

@alexrockhill maybe the version where we have a wrapper for current state stc to array could work - right now we only take this input, but once the new class is there, we can extend that to work with a more dimensional stc as well.

drammock commented 2 years ago

It would still be great to write the visualization with already anticipating coming changes to the input (i.e. stc) though - to make upcoming adaptions as smooth as possible.

yes, definitely. I haven't yet finalized a source space TFR object's methods/properties yet, but based on existing sensor-space TFR objects you can expect .times, .freqs, either .data or (more likely) .get_data(times, freqs), .vertices, .shape, .subject, .method (e.g. 'morlet'), .crop(tmin, tmax, fmin, fmax) and .save(fname). Data shape will likely be vertices x frequencies x times. All this is open to discussion of course, if changing something will make the viz much easier.

alexrockhill commented 2 years ago

if the data were too large, you might need to compute the psd per channel and then have the user assemble them all into a spectrum class via the method that best utilizes their RAM

I guess my gut reaction is that if the data is too large to generate the spectrum/spectrogram in-memory, then users should consider either (1) a lower-density source space, or (2) fewer timepoints/frequency bins (AKA downsample / decimate). If neither of those are possible, maybe the pipeline shouldn't be running on a laptop? I know it sounds a bit harsh, but some datasets/analyses are big enough to start to look like "big data" and I'm uncertain how far we should go to accommodate running such jobs on under-resourced hardware. The more magic we do under the hood, the harder that code becomes to maintain.

(side note: FFT and an MNE or dSPM inverse are all linear operations, so you can do FFT on sensors before applying inverse and get the same answer as doing inverse first and then FFT, but FFT first is much more efficient because in typical cases you're doing FFT on ~300 channels instead of ~20k vertices.)

As far as across trials, we have some support for a return_generator kwarg (e.g., STCs from Epochs); my instinct would be to keep supporting something like that.

I thought about what you said more here @drammock, and perhaps you are right that this is a more reasonable place to start given that we put in the examples and tutorials how to concatenate over time. Whether you want to loop over channels or over chunks of data in time, I think both should be supported as I can see use cases for either (e.g. channels might be for something like sEEG where stats might be contact-level because it's difficult to compare across groups with different montages whereas an MEG/EEG analysis would probably use the same sensor set/montage and the time chunks given lots of data would be more appropriate). Ideally, both ways would be reasonable to make a source space and I think we're a ways away but maybe we can get there this summer. Thanks for the feedback, I'll try and get started on this implementation probably early next week once https://github.com/mne-tools/mne-python/pull/10777 and a followup to add the basic GUI merge.

alexrockhill commented 2 years ago

Per conversation on Discord, it sounds like everyone is on board making an stc class that would be returned by an apply DICS beamformer to epochs that iterates over time windows (but computes the covariance using the whole epoch per @britta-wstnr 's point) that has times at the center of each window. I think it would be better to make it 5D (3D vertices x frequencies x times) rather than a list of stcs as @britta-wstnr suggested would be the minimal possible change since that it would be compatible with other methods e.g. LCMV and MNE-type methods and because it's more elegant. The one thing is that it equates a binned time window with a single central time but I don't think that's really all that misleading since there are all sorts of operations that do similar things.

drammock commented 2 years ago

I think it would be better to make it 5D (3D vertices x frequencies x times)

This would not be consistent with current STC class, where "space" is really "vertex number" and only uses 1D

Also maybe I'm just misunderstanding your text above, but it seems different from how I remember the conversation. My memory is "mock up a 3D STC class (i.e. make a one-off with whatever init signature is convenient, and give it only the methods needed for the GUI testing), then instantiate it and use it to test your WIP GUI. No need to make the mocked class actually be returned by any of our existing functions yet." Please correct me if I'm mis-remembering @britta-wstnr @agramfort

agramfort commented 2 years ago

Yes. 1d axis for space with vertno for indexing a source space is what is consistent

And yes mock an object just to make the gui doable. We will see next how to adjust the public functions to return such an object. You can use a function that converts the list of list of stc to this new object for testing on real data

britta-wstnr commented 2 years ago

Yup, I agree with @drammock and @agramfort. Also that it is only in principle 5D, but a 3D object.