mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.66k stars 1.31k forks source link

Resampling on non-preloaded data #4398

Closed skjerns closed 7 years ago

skjerns commented 7 years ago

According to http://martinos.org/mne/dev/auto_examples/preprocessing/plot_resample.html this should be possible

When resampling epochs is unwanted or impossible, for example when the data doesn’t fit into memory or your analysis pipeline doesn’t involve epochs at all, the alternative approach is to resample the continuous data. This can also be done on non-preloaded data

raw=mne.io.read_raw_brainvision('somefile.vhdr')
raw.copy().resample(100)

gave me raw.resample requires raw data to be loaded. Use preload=True (or string) in the constructor or raw.load_data().

this is related to #4381 : I would like to resample only certain channels, without loading the whole data file. If there are any ideas on how to do that in MNE I would be grateful :) (right now I'm resampling manually with scipy.signal, but I trust the MNE routines more on EEG data). Second issue (but known as I see #1814) is that even with the data preloaded and all unnecessary channels dropped it has been running for more than 30 minutes (is it stuck in a loop?), while with scipy.signal it takes ~5 seconds.

agramfort commented 7 years ago

see that example loads data:

raw = mne.io.read_raw_fif http://martinos.org/mne/dev/generated/mne.io.read_raw_fif.html#mne.io.read_raw_fif(raw_fname).crop(120, 240).load_data()

skjerns commented 7 years ago

Yes, but then the documentation is wrong?

Then the second part of my issue comes into play: It seems to be stuck in a loop since around 30 minutes. 2 Processes seem to be running (although I set n_jobs=4) with full process usage.

larsoner commented 7 years ago

Yeah that documentation appears to be incorrect.

Are you using npad='auto'? What is len(raw.times) and raw.info['sfreq']? It's possible that the frequency-domain resampling is suboptimal here and might be a good case for / reason to implement polyphase resampling.

skjerns commented 7 years ago

Are you using npad='auto'? What is len(raw.times) and raw.info['sfreq']

I didn't touch npad so I think it should default to 'auto'. len(raw.times) is 5157120 (8h sleep recordings) raw.info['sfreq'] is 127.99

(Yes I know it's a very long recording, that might explain the problems with the FFT resampling)

larsoner commented 7 years ago

Hmmm... how long does raw.copy().filter(None, 10., picks=[0]) (assuming you are on latest master) take? That will give some idea of how long the polyphase resampling would take for a single channel.

larsoner commented 7 years ago

And I'm curious why you need to resample to 100 Hz? It doesn't seem like it would affect the analyses too much.

skjerns commented 7 years ago

raw.copy().filter(None, 10., picks=[0]) takes 0.8 seconds. (I'm on 0.15.dev0, should be the latest dev?) I am using the input to feed a neural network and want to standardize the input to 100 Hz.

larsoner commented 7 years ago

Can you try this code? It takes 1.24 sec on my machine:

import mne
import numpy as np
import time

n_samples = 5157120
n_chan = 1
raw = mne.io.RawArray(np.random.randn(n_chan, n_samples),
                      mne.create_info(n_chan, 127.99, 'eeg'))
t0 = time.time()
raw.resample(100.)
print(time.time() - t0)

It suggests that resampling this data would take 1.24 * len(raw.ch_names) (how many channels are there)?

skjerns commented 7 years ago

1.8 sec for me I am using a subset of 3 channels. So it should not take much longer.

I figured it's a problem with using the original header.

raw = mne.io.read_raw_brainvision('file.vhdr', preload=True)

data,_ = raw[[0,1,2],:]
new_raw = mne.io.RawArray(data, mne.create_info(3, 127.99, 'eeg'))

t0 = time.time()
new_raw.resample(100.)
print(time.time() - t0)
# takes a few seconds

raw.pick_channels(['C3', 'LOC', 'EMG1'])
t0 = time.time()
raw.resample(100.)
print(time.time() - t0)
# get's stuck

I think this workaround should do for me :) thanks!

larsoner commented 7 years ago

Is it possible for you to share the file privately (or another similarly failing file)? It would be good to understand what in MNE broke it.

skjerns commented 7 years ago

I've sent you a mail. Thank you very much for looking into this!

larsoner commented 7 years ago

So it turns out I can replicate with my RawArray example above using sfreq=127.99180852425445 (the actual sample rate of your data). This results in a much less favorable irfft length than 127.99 does.

There might be some way to choose the padding so that a favorable lengths are available for both rfft and irfft. I can't think of one offhand, but someone can try if they want. In the meantime, you can also hack a fix into your data by doing this instead of using RawArray:

raw.info['sfreq'] = 127.99

So you can accept some inaccuracy / drift for faster computation. Actually polyphase filtering relies on integer ratios, which would probably also be an approximation of this sample rate, so it wouldn't fix this problem. (And usually in polyphase filtering there is some acceptable inaccuracy, so maybe choosing a close but not equivalent sample rate is acceptable here.)

Let's leave this open as a reminder to fix the tutorial wording, though.

skjerns commented 7 years ago

Ah! That makes sense. Thanks!

Could it be that the original sample rate was 128 but due to conversion to/from float it got saved as 127.99180852425445? I think I will just use np.round() also to avoid problems with other odd sample rates in the future, or do you recommend staying with the 2 decimals?

larsoner commented 7 years ago

Could it be that the original sample rate was 128 but due to conversion to/from float

This looks like too much precision loss for this to be the case, at least for 32-bit float.

With rounding to 128 you'd be off by ~0.008 instead of ~0.002 Hz for using 127.98. Assuming the reported sample rate is correct, your recording was 40292.57856 seconds long, and using 128 Hz it would appear to be 40290 seconds long exactly. (FWIW the fact that 128 divides evenly into the number of samples also suggests that 128 might be the true sample rate, because data are often stored in 1-second buffers.) So perhaps some conversion routine to BrainVision format mis-reports the sample rate?