mne-tools / mne-python

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python
https://mne.tools
BSD 3-Clause "New" or "Revised" License
2.7k stars 1.31k forks source link

ENH? to_hdf5 methods / hd5 reader (via pandas) #1107

Closed dengemann closed 9 years ago

dengemann commented 10 years ago

It just occurred to me that it should be possible to add a general bidirectional hd5 io without writing too many lines. We already have the pandas io and pandas supports pytables/hdf5. I was always thinking the info might be a problem but it's not:

`epochs.info['chs'] == pd.DataFrame(epochs.info['chs']).T.to_dict().values()``

-> True

This also works with 'projs', 'dig', 'comps', etc

So the only thing to really do would be to to call our pandas io for the data, do something like above for channels, projs, dig and comps the rest of non-nested flat values and we're good. On reading it should be easy to recreate e.g. projs objects from dictionaries. Internally all would go in an hd5 file with a key that includes a common prefix such that entries from all fields could be gathered on reading.

This would certainly be a useful io, allowing to basically store data from an entire analysis in one file ...

dengemann commented 10 years ago

Should be even more straight forward than I thought:

http://pandas.pydata.org/pandas-docs/dev/io.html#io-hdf5

Especially section on hierarchical keys.

dengemann commented 10 years ago

Drafting ....

Here's the minimum required to save the main contents of an evoked file:

https://gist.github.com/dengemann/8707623

dengemann commented 10 years ago

Example:

In [137]: store = to_hdf(evoked, 'sample', 'mne.h5')

In [138]: store
Out[138]:
<class 'pandas.io.pytables.HDFStore'>
File path: mne.h5
/sample/acq_pars                   frame        (shape->[1,1])
/sample/acq_stim                   frame        (shape->[1,1])
/sample/bads                       frame        (shape->[1,1])
/sample/buffer_size_sec            frame        (shape->[1,1])
/sample/ch_names                   frame        (shape->[1,1])
/sample/chs                        frame        (shape->[59,13])
/sample/comps                      frame        (shape->[1,1])
/sample/ctf_head_t                 frame        (shape->[1,1])
/sample/data                       frame        (shape->[106,59])
/sample/description                frame        (shape->[1,1])
/sample/dev_ctf_t                  frame        (shape->[1,1])
/sample/dev_head_t                 frame        (shape->[16,3])
/sample/dig                        frame        (shape->[146,4])
/sample/experimenter               frame        (shape->[1,1])
/sample/file_id                    frame        (shape->[2,4])
/sample/filename                   frame        (shape->[1,1])
/sample/filenames                  frame        (shape->[1,1])
/sample/highpass                   frame        (shape->[1,1])
/sample/lowpass                    frame        (shape->[1,1])
/sample/meas_date                  frame        (shape->[2,1])
/sample/meas_id                    frame        (shape->[2,4])
/sample/nchan                      frame        (shape->[1,1])
/sample/orig_blocks                frame        (shape->[5,1])
/sample/orig_fid_str               frame        (shape->[1,1])
/sample/proj_id                    frame        (shape->[1,1])
/sample/proj_name                  frame        (shape->[1,1])
/sample/projs                      frame        (shape->[4,4])
/sample/sfreq                      frame        (shape->[1,1])
mluessi commented 10 years ago

Nice :) but I somehow think if we want to support hdf5 we should write hdf5 directly (using PyTables) without relying on pandas. Also, what would be the advantages over fiff?

dengemann commented 10 years ago

@mluessi two aspects:

https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py

already uses +4k lines of code to do a great job. I don't think we can do better in a short time with reasonable efforts.

The main asset here is that hdf5 is cross-language, cross-platform, cross-whatever and easier to customize. Writing custom structures is straight forward and does not involve making a PR and modfying readers / writers. Also it does not have any file size limitations. I would not see it as a replacement for fiff but as an additional tool for storing custom data, analyzes, and as an exchange format. I'm not informed about benchmarks at this point but I would not be surprised to see pytables + pandas outperforming naive fiff access.

Does that make sense?

mluessi commented 10 years ago

@dengemann I see that the flexibility and efficency of hdf5 is a potential advantage. However, I think if we would officially support it, we would need to properly document what is stored and how, such that other packages (which may not be written Python) can rely on it. If we would rely on pandas to handle this, then we can't really guarantee consistency in the format I.e., if pandas decides to change the way things are stored in hdf5, then it would break the file format for all tools that don't use pandas to read the files..

Btw: Did anything come out of this workshop on electrophysiology file formats that you attended in December?

dengemann commented 10 years ago

@mluessi I understand your concern. If we really want to add support for it we want to get it as right as we can. Probably the way outlined here is rather an extension of our Pandas io which is also interesting, especially given the few lines needed. True hdf5 support would mean that the files can be accessed e.g. from Matlab or any other tool. Yes, the meeting was nice. Let's discuss details during a hangout soon ;-) We should probably open an issue for a hangout and meet with everyone who's interested.

agramfort commented 10 years ago

it's neat but I am not super convinced about the motivation. Matlab can read fif files thanks to mne-matlab. Now if you tell me that you read/write faster that files are smaller etc. than I think we should consider it. Reading pandas HDF5 files will require some work to be read from matlab don't you think?

dengemann commented 10 years ago

Honestly my main motivation is to have a more flexible persistence system. One file to save all stcs or all evokeds in--- I don't want 20 times 8 fifs for one particular analysis. Adding an hdf5 io would be a nice addition for situations where flexibility counts. E.g. imagine you make a figure for a paper where you compare different preprocessing techiques over several runs and you want to share stuff with a colleagues, ideally via email. And so forth. Matlab is really a secondary aspect to me, actuall it's not important in this context, at least to me.

agramfort commented 10 years ago

ok I like animated discussions on github :)

one we have stcs in fif files we could store stcs from many conditions in one file :)

would you store the forward + inverse + evoked + stc in one large file?

dengemann commented 10 years ago

Yeah, that's the spirit. Maybe the one day I want to store everything, the other day only some of the things mentioned :-) Certainly I'd like to store data from nultiple subjects, e.g. after doing first level analyses. This proposal is not to replace fiffs but to support fast custom io beyond numpy arrays where you can basically save and reload mne objects as needed, e.g. save 5 ICA runs without creating a mess and without extending fiff support.

agramfort commented 10 years ago

let's discuss this during the next hangout

Amyunimus commented 10 years ago

Is there a way to read an h5 file into MNE python? Or to convert it to .FIF using python for subsequent analyses?

larsoner commented 10 years ago

HDF5 is a pretty general format, so there isn't a clear-cut way to support reading arbitrary HDF5 files. h5py, pandas, and pytables all facilitate reading HDF5-formatted files (depending on how they were created). If you can get your data into a matrix using these packages, then you can use RawArray (or related classes) to operate on your data in mne-python, including saving the file to FIFF format, sure. But you'd want to be careful about what the channel types etc. were being set to.

dengemann commented 10 years ago

we've had an issue + demo code somewhere ... but indeed it's not easy to provide something general. RawArray and EpochsArray are your friends

Amyunimus commented 10 years ago

Great, thanks. Where is the documentation for RawArray? The mne.io page for it isn't available.

larsoner commented 10 years ago

http://martinos.org/mne/dev/generated/mne.io.RawArray.html#mne.io.RawArray

larsoner commented 10 years ago

You'll probably need to update to latest master, the location of RawArray has shifted in the last couple weeks in the dev version.

Amyunimus commented 10 years ago

So, I just updated to the latest version on git, but I'm not seeing the mne.io module:

AttributeError: 'module' object has no attribute 'io'

dengemann commented 10 years ago

how did you install it after updating?

dengemann commented 10 years ago

You should now find create_info in mne namespace

to read the doc:

mne.create_info?

Amyunimus commented 10 years ago

I was able to find a cached version of it, but thanks!

I'm having a lot of issues trying to get my h5 file to look as functionally close to the .fif files as possible. All I want to do is simple epoch analyses. I managed to get my data and trigger info into two pandas dataframes, but I can't figure out how to organize these so that all the MNE scripts read it as per usual.

The data is 68 chs x number of time points. The trigger data contains a column with the trigger codes and a column with their timing.

Any advice on how to just get this into MNE? I saw you had a similar issue a couple years ago with another type of 4d file...

dengemann commented 10 years ago

On Jun 19, 2014, at 7:09 AM, Amyunimus notifications@github.com wrote:

I was able to find a cached version of it, but thanks!

I'm having a lot of issues trying to get my h5 file to look as functionally close to the .fif files as possible. All I want to do is simple epoch analyses. I

it doe not have to be like fiff. that's why we created RawArray. The idea is that you pass data and channel names and corresponding channel types. Mne will do the rest. mocking a fiff is unrealistic. it will take forever ...

managed to get my data and trigger info into two pandas dataframes, but I can't figure out how to organize these so that all the MNE scripts read it as per usual.

can you share them? The data is 68 chs x number of time points. The trigger data contains a column with the trigger codes and a column with their timing.

Any advice on how to just get this into MNE? I saw you had a similar issue a couple years ago with another type of 4d file...

what kind of system are your data from?

— Reply to this email directly or view it on GitHub.

Amyunimus commented 10 years ago

The RawArray read in the info and data fine, but that's it -- the usual men functions then did not plot the channels and did not let me sort them into epochs. I'm getting lots of deep errors, so I figured it was a formatting issue.

The data are from a Neuroscan system, originally in .cnt format. After some digging, I finally found a python script to get this into an open format, which gave me the .h5 file.

Here's the original .cnt file and the cnt2h5 script I used.

dengemann commented 10 years ago

ok, you're not supposed to pass a dataframe. RawArray is designed for plain 2d channels w times numpy arrays. Good you bring up this case, we will make it throw errors in the future once somerhing else as numpy arrays is passed. I can take a look at the data / script this afternoon. Can you try once more with a numpy array? Also were you able to construct the info?

Amyunimus commented 10 years ago

RawArray did read the pandas dataframe. I tried it now with numpy arrays, and it read the data. But I'm getting index out of bounds errors that I don't understand when I try to plot it or epoch it.

dengemann commented 10 years ago

The only possible reason is that your info object does not match the data. The input to create_info should be a list of names of length n channels and the same for ch type.

e.g. ˋ`ˋ ch_names = ['EEG{}'.format(i+) for i in range(68)]

ch_types = ['eeg'] * 68

ˋˋˋ

the data input shoul be rows = channeld and columns = times. This is btw. not the default for dataframes whis is why it cannot end good if you just pass a df.

dengemann commented 10 years ago

sorry the 1 is missing in the example above: i+1

agramfort commented 10 years ago

we need an example

"Creating MNE data objects from arrays"

to point people to.

I am opening an issue