mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.7k stars 1.31k forks source link

When loading EDF+ with different sampling frequencies, don't resample blockwise #10635

Closed skjerns closed 1 year ago

skjerns commented 2 years ago

Describe the new feature or enhancement

It is possible to load EDF+ files with different sampling frequencies. To my understanding, internally, the channels with differing sampling frequencies are resampled. I understand that the underlying data format BaseRAW assumes that all channels have the same sampling frequency, so resampling the signals to a common denominator is strictly necessary.

Describe your proposed implementation

However, this resampling is happening per block, i.e. every 10 MB of data, and therefore introduces edge artefacts (as also noted in the comments). These are mostly minor, however, in some of my calculations they do indeed change the results of our analysis.

https://github.com/mne-tools/mne-python/blob/549de120926465a514ac5d7b0b8950f49ec2ade5/mne/io/edf/edf.py#L330=

Describe possible alternatives

I think it would be good to load all channels into different data buffers, and only resample them after loading the entire file. However, maybe there was a reasoning why the resampling takes place on a per-block basis (e.g. efficiency?), so please let me know if it would be worth implementing such a change, or if I could give it a try to refactor the function to resample the data after loading everything.

agramfort commented 2 years ago

oh waouh.... to me resampling by buffer is really a bad idea. I have not realized it was done like that.

I don't we can address this issue very rapidly. Can you go around the pb with your analysis eg by exclude the channels with a higher sampling rate?

Message ID: @.***>

cbrnr commented 2 years ago

We should definitely resample the whole signal to avoid edge artifacts. Maybe the blockwise implementation is used to support on-demand reading of specific data segments?

skjerns commented 2 years ago

oh waouh.... to me resampling by buffer is really a bad idea. I have not realized it was done like that. I don't we can address this issue very rapidly. Can you go around the pb with your analysis eg by exclude the channels with a higher sampling rate?

Unfortunately this is not possible in my case, as by accident Pz and M1/M2 were recorded with different sampling frequencies. However, I can resort to loading everything with pyedflib and putting it manually into a RawArray

btw: what do you abbreviate with pb? never seen it before and google doesnt help me

skjerns commented 2 years ago

Workaround:

from pyedflib import highlevel
import numpy as np
import mne

def resample(data, o_sfreq, t_sfreq):
    """
    resample a signal using MNE resample functions
    This automatically is optimized for EEG applying filters etc

    :param raw:     a 1D data array
    :param o_sfreq: the original sampling frequency
    :param t_sfreq: the target sampling frequency
    :returns: the resampled signal
    """
    if o_sfreq==t_sfreq: return data
    raw = np.atleast_2d(data)
    ch_names=['ch{}'.format(i) for i in range(len(raw))]
    info = mne.create_info(ch_names=ch_names, sfreq=o_sfreq, ch_types=['eeg'])
    raw_mne = mne.io.RawArray(raw, info, verbose='WARNING')
    resampled = raw_mne.resample(t_sfreq, n_jobs=-1, verbose='WARNING')
    new_raw = resampled.get_data().squeeze()
    return new_raw.astype(raw.dtype, copy=False)

def load_edf(edf_file, channels):
    """
    Parameters
    ----------
    edf_file : str
        edf file name.
    channels : list of str
        list of channel names to load.

    Returns
    -------
    raw : TYPE
        DESCRIPTION.
    convenience function to load an EDF+ without MNE.
    Workaround for https://github.com/mne-tools/mne-python/issues/10635"""

    if not isinstance(channels, list):
        channels = [channels]

    sigs, sigheads, header = highlevel.read_edf(edf_file, 
                                               ch_names=channels)

    sfreqs = [s['sample_rate'] for s in sigheads]
    labels = [s['label'] for s in sigheads]

    if len(set(sfreqs))>1:
        sigs = [resample(x, sfreqs[i], min(sfreqs)) for i, x in enumerate(sigs)]

    info = mne.create_info(labels, min(sfreqs), ch_types='eeg')

    raw = mne.io.RawArray(sigs, info)

    return raw
cbrnr commented 2 years ago

@skjerns I think it would be great if you added this code snippet to our docs. Maybe a short note about the problem and your proposed workaround would fit here?

cbrnr commented 2 years ago

Oh, and pb = problem.

skjerns commented 2 years ago

I can do that.

cbrnr commented 1 year ago

@skjerns would you still be up to implement a proper solution, i.e. resampling of the entire signal? The current block-wise resampling is highly problematic to the point where I'd rather not do it at all (and let users only load channel groups with the same sampling frequency to avoid block-wise resampling altogether).

skjerns commented 1 year ago

I can probably have a look at it after some deadlines have past (~end the end of february)

skjerns commented 1 year ago

@cbrnr I'm struggeling a bit to implement it in a proper way.

Currently I have a working implementation that loads the data and upsamples to the highest sampling frequency in the file of all requested channels by loading all data into lists and upsampling once in the end. _mult_cal_one is called all in the end after all data has been upsampled. However, this process also resamples when accessing only a single channel when not preloading e.g.

raw = mne.io.read_raw_edf('diff-sfreq.edf') # file containing 10, 20 and 30 Hz, no preloading
data = raw.get_data(f'10hz}') # result will be upsampled to 30 Hz

raw = mne.io.read_raw_edf('diff-sfreq.edf', exclude=['30hz']) # file containing 10, 20 and 30 Hz, no preloading
data = raw.get_data(f'10hz}') # result will be upsampled to 20 Hz

The problem is that data is given to the _read_segment_file as a view and is already expecting 30 Hz, there is not much I can do about that without changing BaseRaw, as far as I can see?

How would you say to proceed?

Solutions:

Besides that it seems to work, just some tests throw errors for small deviations in values when loaded on the fly or the entire signal at once when compared to loading the data from smaller segments. I need to look into how to fix that (i.e. test_edf_others). It might be that the "new" values are correct, and the old ones are wrong due to edge artifacts. Any idea how to address this if it happens, without tempering with other files in the repository too much (the test is defined in base/tests)?

agramfort commented 1 year ago

@cbrnr https://github.com/cbrnr I'm struggeling a bit to implement it in a proper way.

Currently I have a working implementation that loads the data and upsamples to the highest sampling frequency in the file. However, this also resamples when accessing only a single channel e.g.

raw = mne.io.read_raw_edf('diff-sfreq.edf') # file containing 10, 20 and 30 Hzdata = raw.get_data(f'10hz}') # result will be upsampled to 30 Hz

yes it's the expected behavior.

you avoid this you need to pick the channels you want in the init. Once you created the raw instance you have defined the info["sfreq"] which is not channel specific.

Message ID: @.***>

skjerns commented 1 year ago

@cbrnr https://github.com/cbrnr I'm struggeling a bit to implement it in a proper way. Currently I have a working implementation that loads the data and upsamples to the highest sampling frequency in the file. However, this also resamples when accessing only a single channel e.g. raw = mne.io.read_raw_edf('diff-sfreq.edf') # file containing 10, 20 and 30 Hzdata = raw.get_data(f'10hz}') # result will be upsampled to 30 Hz yes it's the expected behavior. you avoid this you need to pick the channels you want in the init. Once you created the raw instance you have defined the info["sfreq"] which is not channel specific.

ok, thanks!

Now the only problem remaining is one tests fails when preload=False and specific timeframes are accessed. I presume this is due to edge artifacts, as it only happens for channels which were resampled:

this test: https://github.com/mne-tools/mne-python/blob/bf2502166eb15626c1205accc2d2d467535b8d93/mne/io/edf/tests/test_edf.py#L177-L182

which calls this test in io/tests/test_raw.py https://github.com/mne-tools/mne-python/blob/bf2502166eb15626c1205accc2d2d467535b8d93/mne/io/tests/test_raw.py#L120-L125

Plotting differences between the loaded signals reveals A9 having clear edge artefacts, as it is indeed the only resampled channel test_reduced edf-1

However, some other time indices look weird and not like clear edge artifacts:

test_reduced edf-5 test_reduced edf-7

Any idea what to do with this? Simply remove the test?

agramfort commented 1 year ago

@skjerns where can I see your code? Did I miss the pull request?

skjerns commented 1 year ago

@agramfort sure: https://github.com/mne-tools/mne-python/pull/11549

cbrnr commented 1 year ago

Did I understand you correctly that with preload=True everything already works as expected? If so, preload=False triggers resampling when data is accessed on-the-fly, so it shouldn't really change that much (except if the previous block size is different from the current one). I mean, the idea to perform resampling on the whole signal only makes sense for preload=True, so do we even have to support the preload=False case? What if we leave that as is and output a warning if some (block-wise) resampling is going on?