inst.to_data_frame() should allow to export index as datetime64

agramfort commented 2 years ago

here is an example of script I just had to write for a collaborator:

from pathlib import Path
import pandas as pd
import mne

sample_dir = Path(mne.datasets.sample.data_path())
sample_fname = sample_dir / 'MEG' / 'sample' / 'sample_audvis_raw.fif'

raw = mne.io.read_raw_fif(sample_fname, preload=True)
raw.crop(tmax=10)

df = raw.to_data_frame()
df = df.set_index("time")

index =  pd.date_range(start=raw.info['meas_date'],
                       periods=len(df) + raw.first_samp,
                       freq=f'{1e3 / raw.info["sfreq"]:0.6f}ms')
df.index = index[raw.first_samp:]

what I have in mind is that we can do

raw.to_data_frame(time_format='date')

to get the time as datetime64. Also I wonder why time is not set as index by default but It's more a matter of taste

@hoechenberger @dengemann @drammock what do you think?

dengemann commented 2 years ago

yes we should support that! no strong feelings on defaults.

On Mon, 17 Jan 2022 at 19:20, Alexandre Gramfort @.***> wrote:

here is an example of script I just had to write for a collaborator:

from pathlib import Path import pandas as pd import mne

sample_dir = Path(mne.datasets.sample.data_path()) sample_fname = sample_dir / 'MEG' / 'sample' / 'sample_audvis_raw.fif'

raw = mne.io.read_raw_fif(sample_fname, preload=True) raw.crop(tmax=10)

df = raw.to_data_frame() df = df.set_index("time")

index = pd.date_range(start=raw.info['meas_date'], periods=len(df) + raw.first_samp, freq=f'{1e3 / raw.info["sfreq"]:0.6f}ms') df.index = index[raw.first_samp:]

what I have in mind is that we can do

raw.to_data_frame(time_format='date')

to get the time as datetime64. Also I wonder why time is not set as index by default but It's more a matter of taste

@hoechenberger https://github.com/hoechenberger @dengemann https://github.com/dengemann @drammock https://github.com/drammock what do you think?

— Reply to this email directly, view it on GitHub https://github.com/mne-tools/mne-python/issues/10213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOR7CUE5HEMT4O62JNXG6LUWRMWTANCNFSM5MFKG6RA . You are receiving this because you were mentioned.Message ID: @.***>

hoechenberger commented 2 years ago

a datetime index would certainly make sense.

drammock commented 2 years ago

This is already supported. Quoting the docstring of the time_format:

If 'datetime', time values will be converted to :class:pandas.Timestamp values, relative to raw.info['meas_date'] and offset by raw.first_samp.

Setting time as index automatically is possible by passing index='time'. If I run your snippet up through raw.crop(tmax=10) and then:

In [5]: raw.to_data_frame(time_format='datetime', index='time')
Out[5]: 
channel                                MEG 0113   MEG 0112    MEG 0111  ...    EEG 059    EEG 060     EOG 061
time                                                                    ...                                  
2002-12-03 19:01:53.676070829+00:00   96.435548 -48.217774  101.074222  ...  38.854217  65.839113  285.661012
2002-12-03 19:01:53.677735789+00:00    0.000000 -28.930664   63.171389  ...  40.751037  68.002565  283.699953
2002-12-03 19:01:53.679400749+00:00    0.000000  -9.643555   75.805667  ...  40.995788  68.177980  280.431520
2002-12-03 19:01:53.681065709+00:00  125.366213  19.287110  101.074222  ...  41.179352  68.587282  279.124147
2002-12-03 19:01:53.682730669+00:00  163.940432   0.000000    0.000000  ...  39.343719  67.242433  281.738893
...                                         ...        ...         ...  ...        ...        ...         ...
2002-12-03 19:02:03.669161407+00:00  -19.287110 -38.574219 -176.879889  ...  44.299926  62.857057  265.396730
2002-12-03 19:02:03.670826367+00:00  -19.287110  -9.643555 -113.708500  ...  46.013183  64.552736  267.357790
2002-12-03 19:02:03.672491327+00:00  -28.930664   9.643555   25.268556  ...  50.418701  68.061036  273.240968
2002-12-03 19:02:03.674156288+00:00  -28.930664   9.643555   37.902833  ...  52.621460  69.405885  275.202028
2002-12-03 19:02:03.675821248+00:00  -77.148438  -9.643555  138.977056  ...  52.437896  69.522829  271.279909

[6007 rows x 376 columns]

hoechenberger commented 2 years ago

🎉

agramfort commented 2 years ago

hum indeed now the way it's done now leads to:

df.index.freq == None

what I suggested above keep the sample frequency as it gives:

In [33]: df.index.freq
Out[33]: <1664960 * Nanos>

what do you think @drammock ?

drammock commented 2 years ago

yeah, currently there is no freq because it's implemented by converting times to a timedelta, then adding that to the meas_date:

https://github.com/mne-tools/mne-python/blob/7ecd46fe444a9c7e9d2b0c7f63ef05af09d41082/mne/utils/dataframe.py#L44-L46

if you think having .freq is important I don't object to changing the implementation

drammock commented 2 years ago

@agramfort I took a look at (something similar to) your implementation. The main problem is that your way of doing it necessarily risks rounding error when converting 1 / sfreq to an integer number of nanoseconds (not a problem if sfreq is ~~an integer~~ a nice integer like 1000, but for sample dataset you see the issue).

For your snippet of 10s of data, the last sample time is off by 1488 nanoseconds:

_, times = raw[:]
main = to_timedelta(times + raw.first_time, unit='s') + raw.info['meas_date']
alternative = date_range(
    start=raw.info['meas_date'] + to_timedelta(raw.first_time, unit='s'),
    periods=len(times),
    freq=f'{np.rint(1e9 / raw.info["sfreq"]).astype(int)}N')
diff = main[-1] - alternative[-1]
diff.isoformat()
# 'P0DT0H0M0.000001488S'

This means that for a 60 minutes recording the last sample is off by 0.53568 milliseconds (more than half a millisecond). To me that seems too much.

agramfort commented 2 years ago

hum... I need to think... but i get your point.

Message ID: @.***>

mne-tools / mne-python

inst.to_data_frame() should allow to export index as datetime64 #10213