Calling get_data() on preloaded Epochs returns wrong units

pablomainar commented 1 year ago

Description of the problem

Hello,

When creating an Epochs object with the argument preloaded set to True, the get_data method does not work as expected when passing a unit that is not volts. It will work well the first time, but subsequent calls will each time return data that is multiplied by the unit factor. Please note that this only happens when the preload argument is passed as True due to the if statement of line 1670 of epochs.py.

Steps to reproduce

# Create fake EEG Epochs: All channels are a constant value of 1uV.
eeg = np.ones((8, 2500)) * 1e-6
info = mne.create_info(ch_names=[str(i) for i in range(1,9)], sfreq=250, ch_types="eeg")
raw = mne.io.RawArray(eeg, info)
epochs = mne.make_fixed_length_epochs(raw, duration=0.5, overlap=0, preload=True)

# Get the data for the first time: Works well
data = epochs.get_data(units="uV")
print(data[0,0,0]) # This will correctly print the first sample of the first channel of the first epoch in uV (1 uV)

# Get the data for the second time: Unexpected result
data = epochs.get_data(units="uV")
print(data[0,0,0]) # This will return the value from before, but multiplied by 1e6, which is unexpected (100000 uV)

Link to data

No response

Expected results

Subsequent calls to epochs.get_data(units="uV") should always return the same value, since the underlying data is not supposed to be modified. In the previous code, this value should be 1uV.

Actual results

After the first call to epochs.get_data(units="uV") the underlying data is multiplied by 10^6 (if the unit is uV) each time that get_data is called, therefore giving different results each time.

Additional information

Platform macOS-10.16-x86_64-i386-64bit Python 3.8.13 (default, Mar 28 2022, 06:16:26) [Clang 12.0.0 ] Executable /Users/pmainar/miniconda3/envs/xdf/bin/python CPU i386 (8 cores) Memory 16.0 GB

Core ├☑ mne 1.5.0 ├☑ numpy 1.22.4 (OpenBLAS 0.3.18 with 4 threads) ├☑ scipy 1.9.0 ├☑ matplotlib 3.7.0 (backend=module://ipympl.backend_nbagg) ├☑ pooch 1.6.0 └☑ jinja2 3.1.2

Numerical (optional) ├☑ sklearn 1.0.2 ├☑ numba 0.57.1 ├☑ pandas 1.4.2 └☐ unavailable nibabel, nilearn, dipy, openmeeg, cupy

Visualization (optional) ├☑ qtpy 2.2.0 (None=None) ├☑ ipympl 0.9.3 └☐ unavailable pyvista, pyvistaqt, ipyvtklink, vtk, pyqtgraph, mne-qt-browser

Ecosystem (optional) └☐ unavailable mne-bids, mne-nirs, mne-features, mne-connectivity, mne-icalabel, mne-bids-pipeline

welcome[bot] commented 1 year ago

Hello! 👋 Thanks for opening your first issue here! ❤️ We will try to get back to you soon. 🚴

larsoner commented 1 year ago

Yes I think this is a bug, get_data should probably always return a copy of the data not a view. @pablomainar would you be up for trying to fix this?

larsoner commented 1 year ago

(or if we already do return a copy, make sure we make the copy before applying the scaling)

cbrnr commented 1 year ago

Maybe return a copy only if units are changed, and return a view if not? Otherwise, we're forcing unnecessary and slow copy operations on users, which might negatively impact performance.

pablomainar commented 1 year ago

So the problem comes from the code in lines 1670 to 1678 from epochs.py.

if self.preload:
      data = data[select]
      if orig_picks is not None:
          data = data[:, picks]
      if units is not None:
          data *= ch_factors[:, np.newaxis]
      if start != 0 or stop != self.times.size:
          data = data[..., start:stop]
      return data

data is a view of self._data. So slicing with select, picks or start/stop creates a copy but multipliyng by ch_factors is done inplace, therefore changing the underlying self._data.

As @cbrnr mentions it is probably more efficient to simply modify the way the multiplication is done, and return a copy in that case. I will create a PR with a fix asap.

larsoner commented 1 year ago

As @cbrnr mentions it is probably more efficient to simply modify the way the multiplication is done, and return a copy in that case

I think correctness/safety outweighs efficiency here. Returning a view sometimes and a copy others is dangerous and can lead to silent and hard to diagnose (or even see the symptoms of in the first place) bugs. A use could very easily .get_data, do an in place operation on the array, and now their Epochs are silently broken.

If we really want to return a view sometimes I think we'd have to add a copy=True (default) that people can set to false to get non-copied data when it's possible to get it.

larsoner commented 1 year ago

Argh actually it looks like historically (for at least 10 years!) we returned a view / non-copy, e.g.:

https://github.com/mne-tools/mne-python/blob/265b9bfdd43ba571fca1e7271b0481a2ae2eb903/mne/epochs.py#L444

So I guess we should live with it. Looking at https://mne.tools/stable/generated/mne.Epochs.html#mne.Epochs.get_data I don't see any warning about this so @pablomainar would you be up for adding the minimal fix (in the case of unit scaling, add a copy) and also adding a .. warning:: to epochs.get_data that says sometimes the result will be a view, which means that modifying it in-place will modify the data in Epochs itself to be incorrect?

cbrnr commented 1 year ago

Correctness/safety outweighs efficiency, but history trumps everything 🤣!

drammock commented 1 year ago

Argh actually it looks like historically (for at least 10 years!) we returned a view / non-copy, [...] So I guess we should live with it.

This doesn't make sense to me. To me it seems like a bug that we sometimes return a view, and we should fix it (even if the bug is subtle / rarely causes problems / is very old). Is the worry that some users might have figured out that we return a view, and made use of that knowledge to modify the Epochs data intentionally? That seems even less likely than the case of modifications happening unknowingly / by accident.

cbrnr commented 1 year ago

This is clearly documented:

Returns data : array of shape (n_epochs, n_channels, n_times) A view on epochs data.

cbrnr commented 1 year ago

But this is still inconsistent, Raw.get_data() always returns a copy, whereas Evoked.get_data() returns a view, just like Epochs.get_data(). At least according to their docstrings.

drammock commented 1 year ago

This is clearly documented

ah, ok. in that case I feel less strongly that we should change it right now.

But this is still inconsistent, Raw.get_data() always returns a copy, whereas Evoked.get_data() returns a view, just like Epochs.get_data(). At least according to their docstrings.

agree that it's inconsistent. I wouldn't mind changing all of them to have a copy param and migrating to copy=True as the default for all 3 objects.

cbrnr commented 1 year ago

agree that it's inconsistent. I wouldn't mind changing all of them to have a copy param and migrating to copy=True as the default for all 3 objects.

👍

cbrnr commented 1 year ago

Although in some cases, I don't know if it will be that easy to switch from the current copy behavior to a view. After all, NumPy fancy indexing always produces a copy, and we're using that quite often (just not in the case that was reported here).

larsoner commented 1 year ago

I think it's okay to have copy=True mean "always return a copy" and copy=False mean "return a view if possible, and use at your own risk"

cbrnr commented 1 year ago

I think it's okay to have copy=True mean "always return a copy" and copy=False mean "return a view if possible, and use at your own risk"

But in the latter case, I think we should really warn ("you requested a view, but this is not possible with your combination of arguments, so you're getting a copy instead").

larsoner commented 1 year ago

Okay with me

pablomainar commented 1 year ago

Do you think it actually makes sense to return a view at all?

(1) The output of get_data() should never be used to modify the Epochs underlying data. If the user wants to do this, there are other ways. (2) Inside the _get_data() method a variable "data" is currently created as a view of self._data. So it is also dangerous that future updates to the method will end up mistakenly modifying data and therefore self._data. I think self._data should never be modified inside _get_data() and the best way of avoiding it is to create a copy from the very beginning of the method. (3) Even if we want to return a view in some cases for performance reasons, the number of scenarios that we are able to do it is limited. As soon as they ask for a slice in any axis, _get_data() must return a copy.

If we have a copy argument, we might leave (1) for the user with a warning, which might be fine. But (2) can give problems in future versions, as we cannot convert "data" into a copy from the very beginning.

drammock commented 1 year ago

Do you think it actually makes sense to return a view at all?

see @larsoner's comment here:

I think the most common case here is probably for the user "I don't care if it's a copy or not just make it as efficient as possible" often for passing to a sklearn classifier

In other words, I think the two reasonable use cases are

"I'm not going to modify the array anyway I just need the numbers as fast as possible" and
"I might modify the array but I don't need to interact with the Epochs object again once I have a handle for its data"

mne-tools / mne-python