Errors when importing EDF+ data

choldgraf commented 9 years ago

I'm running into some strange behavior when importing EDF+ data. It seems like there's something going wrong with timing and sampling rate. For example, when I try to import my EDF+ file, I get:

RuntimeError: Channels contain differentsampling rates. Must set preload=True

And trying to preload the data returns a memory error.

I've used EDFBrowser to convert the data, and it displays fine in there. I've also confirmed that the data displays fine with EEGLab, and the sampling rate for all channels should be the same. I'm converting these over from Nihon Kohden format (using EDFBrowser).

Any ideas on how I should try to troubleshoot this?

larsoner commented 9 years ago

If I understand the problem correctly, the best solution would be to implement a native Nihon Kohden format (I've never heard of it, though...) reader in mne-python so that you're not converting a file format twice.

Short of doing that, you (or someone) will need to look at why the sample rates seem to be different. When you hit the error, you can do:

import pdb
pdb.pm()

Which puts you in post-mortem debugging -- right at the stack level of the error in mne/io/edf/edf.py. This should be the same location where it has some variable that has the different sampling rates, and using standard pdb commands like p to print, u or d to move up and down the stack, you can investigate the issue. My guess is that the stim channels might be at a different sampling rate, I think someone has had that problem before...

choldgraf commented 9 years ago

Yup - it looks like the final channel is some sort of "event marker". I'm trying to figure out if there's a way to exclude this from the channels.

Here's the n_samps vector, which must all be the same to work without preloading data.

array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100,
       100, 100, 100, 100, 100, 100, 100, 100, 100, 100,  27])

I'm not using the channel at all, so perhaps I can just delete it beforehand somehow. I don't know why MNE is reading in the sampling rate differently from EEGLab or EDFBrowser.

larsoner commented 9 years ago

It's possible those readers just resample channels all to the same rate. It's also possible there is a bug in how mne-python interprets the sample rates. We'd need to figure that part out first before proceeding.

Assuming the sample rates reported by mne-python are correct, it's pretty easy to upsample a stimulus channel, assuming it's just pulses at different heights. You can just use zero interpolation in scipy and the upsampling should be lossless, so that part shouldn't be more than a few lines. The tougher part might be to change the on-the-fly data reading function to read from the correct parts of the files, depending on how they're organized.

Can you make a minimal version of the file that has the same problem, either by truncating it is EDFBrowser or by doing a really short recording with your equipment? It would make debugging easier to be able to share a file that's easy to download.

choldgraf commented 9 years ago

I'll try creating a truncated version of the file. I'm new to EDF / EDFBrowser and it's not trivial to get more EDF data because we're an ecog lab :P

larsoner commented 9 years ago

The data don't have to be meaningful, so If you are with your equipment you can record noise (hooked up to nobody). If the amps are at the hospital, then truncation sounds easier :)

choldgraf commented 9 years ago

Hmmm, so I just tried truncating the file down to 4 channels and "printing" to EDF. However, I'm still getting the same error. Interestingly, even though there are 4 channels, there are 5 values in nsamp:

array([100, 100, 100, 100,  43])

This is with all channels but the first four signals (no stim signals) removed. It seems like there's something else being read out even though it's not supposed to be a channel?

larsoner commented 9 years ago

That does sound strange. I don't really work with EDF so I probably won't be too helpful... @teonlamont might be able to help since he's done most of the EDF work.

teonbrooks commented 9 years ago

so as you may have noticed, the Status/Stim channel is sampled at a lower rate than the other channels. The data are stored in records where each record contains n number of samples so the data structure isn't rectangular (see http://www.biosemi.com/faq/file_format.htm for a more detailed explanation). so selecting specific channels before reading them in becomes a nightmare. we have to read them in then select afterwards like other packages. so that is the explanation behind why preload must be True.

for the memory error, how big of a file are you dealing with? when you tried the smaller file, did it read in fine with preload=True? also, just to check, do you get that the last channel name is Stimulus or Status. you can check with raw.info['ch_names'].

teonbrooks commented 9 years ago

also, if you have a copy of a file that you could send, it would help with the debugging.

choldgraf commented 9 years ago

Thanks for your thoughts - when I use preload=True on a subset of the data, then it loads fine. However, I think I found a reason that it kept raising a MemoryError.

It didn't make sense to me that this was happening, because the file is only ~1GB in total, and this is on a cluster with ~64GB of ram. It is 2 hours of data sampled at 1000Hz with about 60 channels, so about 120,000 x 60, which isn't so bad.

However, when I loaded in the shortened data, I noticed that it's got twice as much data as it should. I should be exporting a 10 second clip of data from EDFBrowser. However, when this gets loaded in, it looks like the object has 100 seconds in it, so everything is larger by a factor of 10. The sampling frequency looks correct, so I'm not sure what could be going wrong.

I can send a copy of the signal if you like, where should I send it? (it's quite small, only ~65kb)

agramfort commented 9 years ago

Share it via Dropbox

On 25 janv. 2015, at 22:27, Chris Holdgraf notifications@github.com wrote:

Thanks for your thoughts - when I use preload=True on a subset of the data, then it loads fine. However, I think I found a reason that it kept raising a MemoryError.

It didn't make sense to me that this was happening, because the file is only ~1GB in total, and this is on a cluster with ~64GB of ram. It is 2 hours of data sampled at 1000Hz with about 60 channels, so about 120,000 x 60, which isn't so bad.

However, when I loaded in the shortened data, I noticed that it's got twice as much data as it should. I should be exporting a 10 second clip of data from EDFBrowser. However, when this gets loaded in, it looks like the object has 100 seconds in it, so everything is larger by a factor of 10. The sampling frequency looks correct, so I'm not sure what could be going wrong.

I can send a copy of the signal if you like, where should I send it? (it's quite small, only ~65kb)

— Reply to this email directly or view it on GitHub.

choldgraf commented 9 years ago

Good call https://www.dropbox.com/s/lphlu1imys1qmta/test_short_sig.edf?dl=0

I think it's anonymized as there wasn't anything in patient_info when I brought it up in MNE, but let me know if that's not true...

teonbrooks commented 9 years ago

I just found the error. The code made the assumption that the number of samples was the number of records times the sampling frequency. it should be number of records times the sampling frequency times the record length. your file has a record length of .1 and typically it is 1, so this allowed the file to read over its limit by the factor of ten. patching this now.

choldgraf commented 9 years ago

Fantastic - let me know when it's pulled and I can try my import again.

On Sun, Jan 25, 2015 at 2:43 PM, Teon notifications@github.com wrote:

I just found the error. The code made the assumption that the number of samples was the number of records times the sampling frequency. it should be number of records times the record length. your file has a record length of .1 and typically it is 1, so this allowed the file to read over its limit by the factor of ten. patching this now.

— Reply to this email directly or view it on GitHub https://github.com/mne-tools/mne-python/issues/1757#issuecomment-71397792 .

agramfort commented 9 years ago

it's merged.

mne-tools / mne-python

Errors when importing EDF+ data #1757