How to deal with anonymization if one of the dates slips out of range?

hoechenberger commented 4 years ago

Working with the somato dataset, I ran into an issue related to the fact that the measurement date I'm intending to anonymize deviates from the date in info['file_id']['secs'] such that, when I reach the desired adjustment for the measurement date, info['file_id']['secs'] goes out-of-range (even though the measurement date would be within the valid range).

MWE:

import mne
import os.path as op
from datetime import datetime

data_path = mne.datasets.somato.data_path()
raw_fname = op.join(data_path, 'sub-01', 'meg', 'sub-01_task-somato_meg.fif')

raw = mne.io.read_raw_fif(raw_fname, verbose='error')
print(f"Date in info['meas_date']: {raw.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.fromtimestamp(raw.info['file_id']['secs'])}")

# Works
print('\nAnonymizing with daysback -> 10 years')
daysback = 10 * 365
raw_anon = raw.copy().anonymize(daysback=daysback)
print(f"Date in info['meas_date']: {raw_anon.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.fromtimestamp(raw_anon.info['file_id']['secs'])}")

# Does not work
print('\nAnonymizing with daysback -> 90 years')
daysback = 90 * 365
raw_anon = raw.copy().anonymize(daysback=daysback)
print(f"Date in info['meas_date']: {raw_anon.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.fromtimestamp(raw_anon.info['file_id']['secs'])}")

Output:

Date in info['meas_date']: 2007-07-05 11:17:11.172243+00:00
Date in info['file_id']:   1970-01-01 01:00:00

Anonymizing with daysback -> 10 years
Date in info['meas_date']: 1997-07-07 11:17:11.172243+00:00
Date in info['file_id']:   1960-01-04 01:35:47

Anonymizing with daysback -> 90 years
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Untitled-2 in 
     21 print('\nAnonymizing with daysback -> 90 years')
     22 daysback = 90 * 365
---> 23 raw_anon = raw.copy().anonymize(daysback=daysback)
     24 print(f"Date in info['meas_date']: {raw_anon.info['meas_date']}")
     25 print(f"Date in info['file_id']:   {datetime.fromtimestamp(raw_anon.info['file_id']['secs'])}")

 in anonymize(self, daysback, keep_his, verbose)

~/Development/mne-python/mne/channels/channels.py in anonymize(self, daysback, keep_his, verbose)
    593         .. versionadded:: 0.13.0
    594         """
--> 595         anonymize_info(self.info, daysback=daysback, keep_his=keep_his,
    596                        verbose=verbose)
    597         self.set_meas_date(self.info['meas_date'])  # unify annot update

 in anonymize_info(info, daysback, keep_his, verbose)

~/Development/mne-python/mne/io/meas_info.py in anonymize_info(info, daysback, keep_his, verbose)
   2245                 'daysback parameter was too large.'
   2246                 'Underlying Error:\n')
-> 2247     _check_dates(info, prepend_error=err_mesg)
   2248 
   2249     return info

~/Development/mne-python/mne/io/meas_info.py in _check_dates(info, prepend_error)
   1454                 if (value[key_2] < np.iinfo('>i4').min or
   1455                         value[key_2] > np.iinfo('>i4').max):
-> 1456                     raise RuntimeError('%sinfo[%s][%s] must be between '
   1457                                        '"%r" and "%r", got "%r"'
   1458                                        % (prepend_error, key, key_2,

RuntimeError: anonymize_info generated an inconsistent info object. daysback parameter was too large.Underlying Error:
info[file_id][secs] must be between "-2147483648" and "2147483647", got "-2838237853"

How shall I go about anonymizing this dataset? I want to use daysback large enough to move the measurement date before 1925, as is required by BIDS.

cc @agramfort

cbrnr commented 4 years ago

Why is info['file_id'] equal to 1970-01-01 01:00:00 and why does this have to be anonymized? It seems like this date has already been set to a date that definitely does not correspond to the file creation date (assuming that the file_id key corresponds to file creation).

agramfort commented 4 years ago

I agree is should not happen that info['file_id'] is not coherent with meas_date

for me it's a bug due to old mne version that was not setting the date correctly everywhere

hoechenberger commented 4 years ago

Ok, so I guess what we should do in this case is

Emit a warning
Anonymize measurement date
Set file date equal to anonymized measurement date

Right? I can make a PR if we all agree.

agramfort commented 4 years ago

you should fix the file with set_meas_date before anonymization

hoechenberger commented 4 years ago

But there are files like this out there in the wild.... should we not take care of them automatically in case of (re)anonymozation? Or is that what you meant: handle this inside Raw.anonymize()?

agramfort commented 4 years ago

ok you can try to warn if the dates are inconsistent and only use meas_date as true value

cbrnr commented 4 years ago

The file_id key seems to be specific to FIF, so I agree that if the sec subfield does not match meas_date, warn, anonymize meas_date, and set file_id equal to the anonymized meas_date.

While we're at it, why does anonymize default to setting the date to Jan 01 2000? Wouldn't it be more convenient to follow the BIDS recommendation to set the date to e.g. Jan 01 1920 instead?

agramfort commented 4 years ago

The file_id key seems to be specific to FIF, so I agree that if the sec subfield does not match meas_date, warn, anonymize meas_date, and set file_id equal to the anonymized meas_date.

yes I don't see how they can become different

While we're at it, why does anonymize default to setting the date to Jan 01 2000? Wouldn't it be more convenient to follow the BIDS recommendation to set the date to e.g. Jan 01 1920 instead?

I think it's better to enforce users to shift the dates to keep consistent ordering of subjects by dates

hoechenberger commented 4 years ago

I agree with @agramfort , there probably should not be a default date

cbrnr commented 4 years ago

I think it's better to enforce users to shift the dates to keep consistent ordering of subjects by dates

Yes, but there is a default, which happens to be a bad one because it's not BIDS-compliant. Are you suggesting we should remove the default (in another PR)?

hoechenberger commented 4 years ago

@cbrnr Yes I suggest we should consider removing the default

cbrnr commented 4 years ago

OK. Some arguments for/against removing the default:

Pros:

Users are forced to select the number of days to shift (explicit is better than implicit).
The relative time deltas between recordings of different subjects is maintained.

Cons:

If users specify the same shift for all subjects in a study, it might be easier to reverse the anonymization (because you only have to guess one number). If we kept a default fixed date, the shift is different for subjects (at least if the recordings took place on different days).
The relative time deltas between recordings might not be relevant at all - I can't think of a reason where this might be important (as long as subject age can be inferred, which is still the case), so why not provide a sensible default then.
Is this API already in a release? If so, we'd have to go through the usual deprecation cycle.

I don't really see the point of removing the default, I think it's more convenient to let MNE handle the anonymization - but I'd set a different default to <1925.

agramfort commented 4 years ago

mne-bids will complain. mne-python is not tied to bids. You will write bids files with mne-bids so I don't see the urge to change anything here. Now we need to make sure mne-bids warns properly.

drammock commented 4 years ago

The relative time deltas between recordings might not be relevant at all - I can't think of a reason where this might be important (as long as subject age can be inferred, which is still the case)

An example would be a training study where the number of days between different recordings of the same subject was an important covariate.

hoechenberger commented 4 years ago

Yup. I always report inter-session intervals for multi-session studies. Would be nice to have this information contained in the info.

-- Sent from my phone, please excuse brevity and erroneous auto-correct.

On 18. Aug 2020, at 17:14, Daniel McCloy notifications@github.com wrote:

The relative time deltas between recordings might not be relevant at all - I can't think of a reason where this might be important (as long as subject age can be inferred, which is still the case) An example would be a training study where the number of days between different recordings of the same subject was an important covariate.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

cbrnr commented 4 years ago

True, I was only thinking of between subjects. I think the argument should support a date in addition to days.

hoechenberger commented 4 years ago

True, I was only thinking of between subjects. I think the argument should support a date in addition to days.

I was thinking the same, but then – again! – it wouldn't be easy to preserve relative date differences!

Considering this, having a default date as we have now is not very useful in this regard either. However, changing it to a fixed number of days could lead to an easy de-anonymization if users forget about manually passing daysback.

Hence I conclude: we actually cannot have any reasonable default.

But really this is beyond the scope of this particular issue :)

Here I will only focus on this:

@agramfort

ok you can try to warn if the dates are inconsistent and only use meas_date as true value

will make a PR shortly

hoechenberger commented 4 years ago

Looking at this again with @agramfort, there seems to be another problem involved: Raw.set_meas_date() doesn't touch the dates in info['file_id'] -- maybe that's on purpose? MWE:

# %%
import mne
import os.path as op
from datetime import datetime

data_path = mne.datasets.somato.data_path()
raw_fname = op.join(data_path, 'sub-01', 'meg', 'sub-01_task-somato_meg.fif')

raw = mne.io.read_raw_fif(raw_fname, verbose='error')
print(f"Date in info['meas_date']: {raw.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.utcfromtimestamp(raw.info['file_id']['secs'])}")

print('\nAfter calling set_meas_date():')
raw.set_meas_date(raw.info['meas_date'])
print(f"Date in info['meas_date']: {raw.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.utcfromtimestamp(raw.info['file_id']['secs'])}")

raw.anonymize(daysback=30000)

Output:

Date in info['meas_date']: 2007-07-05 11:17:11.172243+00:00
Date in info['file_id']:   1970-01-01 00:00:00

After calling set_meas_date():
Date in info['meas_date']: 2007-07-05 11:17:11.172243+00:00
Date in info['file_id']:   1970-01-01 00:00:00
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Untitled-2 in 
     17 print(f"Date in info['file_id']:   {datetime.utcfromtimestamp(raw.info['file_id']['secs'])}")
     18 
---> 19 raw.anonymize(daysback=30000)

 in anonymize(self, daysback, keep_his, verbose)

~/Development/mne-python/mne/channels/channels.py in anonymize(self, daysback, keep_his, verbose)
    593         .. versionadded:: 0.13.0
    594         """
--> 595         anonymize_info(self.info, daysback=daysback, keep_his=keep_his,
    596                        verbose=verbose)
    597         self.set_meas_date(self.info['meas_date'])  # unify annot update

 in anonymize_info(info, daysback, keep_his, verbose)

~/Development/mne-python/mne/io/meas_info.py in anonymize_info(info, daysback, keep_his, verbose)
   2245                 'daysback parameter was too large.'
   2246                 'Underlying Error:\n')
-> 2247     _check_dates(info, prepend_error=err_mesg)
   2248 
   2249     return info

~/Development/mne-python/mne/io/meas_info.py in _check_dates(info, prepend_error)
   1454                 if (value[key_2] < np.iinfo('>i4').min or
   1455                         value[key_2] > np.iinfo('>i4').max):
-> 1456                     raise RuntimeError('%sinfo[%s][%s] must be between '
   1457                                        '"%r" and "%r", got "%r"'
   1458                                        % (prepend_error, key, key_2,

RuntimeError: anonymize_info generated an inconsistent info object. daysback parameter was too large.Underlying Error:
info[file_id][secs] must be between "-2147483648" and "2147483647", got "-2591997853"

cbrnr commented 4 years ago

I have no clue what info['file_id'] contains and the docs don't really explain it either.

agramfort commented 4 years ago

here is a fix:

import mne
import os.path as op
from datetime import datetime, timezone

data_path = mne.datasets.somato.data_path()
raw_fname = op.join(data_path, 'sub-01', 'meg', 'sub-01_task-somato_meg.fif')

raw = mne.io.read_raw_fif(raw_fname, verbose='error')

print(f"Date in info['meas_date']: {raw.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.utcfromtimestamp(raw.info['file_id']['secs'])}")

print('\nAfter calling set_meas_date():')
meas_date = datetime.utcfromtimestamp(
    raw.info['file_id']['secs'] + 1e-6 * raw.info['file_id']['usecs'])
meas_date = meas_date.replace(tzinfo=timezone.utc)
raw.set_meas_date(meas_date)
print(f"Date in info['meas_date']: {raw.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.utcfromtimestamp(raw.info['file_id']['secs'])}")

raw.anonymize(daysback=20000)
print(f"Date in info['meas_date']: {raw.info['meas_date']}")
print(f"Date in info['file_id']:   {datetime.utcfromtimestamp(raw.info['file_id']['secs'])}")

don't know what happened to this file. But what bothers now with current code is that when you anonymize the default for file_id is 1970 and for meas_date it's 2000 so you quickly end up with data where file_id and meas_date are not consistent. For me this should not happened. This deserves a first fix and then I think we should make sure that when we set_meas_date the file_id should then be adjusted.

thoughts? @alexrockhill @larsoner ?

larsoner commented 4 years ago

It does seem incorrect that these are not consistent. Worth trying to fix it to see if it breaks some existing test -- this might tell us if we had some reason to do it this way

agramfort commented 4 years ago

looking at it now

hoechenberger commented 4 years ago

Not sure if this is exactly the same problem, but it seems related: in the following MWE, info['file_id']['secs'] doesn't survive an I/O roundtrip.

import mne
from mne.datasets import sample

data_path = sample.data_path()
subjects_dir = data_path + '/subjects'
fname_in = data_path + '/MEG/sample/sample_audvis_raw.fif'
fname_out = '/tmp/raw.fif'

# Create a copy.
raw = mne.io.read_raw_fif(fname_in, verbose=False)
raw.save(fname_out, overwrite=True)

# Read copy.
raw = mne.io.read_raw_fif(fname_out, preload=True, verbose=False)
print(f"Before anonymization: {raw.info['file_id']['secs']}")

# Anonymize.
raw.anonymize(daysback=-20)
print(f"After anonymization: {raw.info['file_id']['secs']}")

# Test I/O roundtrip.
raw.save(fname_out, overwrite=True)
raw = mne.io.read_raw_fif(fname_out, verbose=False)
print(f"After anonymization & I/O roundtrip: "
      f"{raw.info['file_id']['secs']}")

Output:

Before anonymization: 0
After anonymization: 1730147
After anonymization & I/O roundtrip: 0

alexrockhill commented 4 years ago

I think @bloyl worked on this part if I'm not mistaken. I can take a look but I was only using the functionality as an end-user with mne-bids

hoechenberger commented 4 years ago

I ran into this when trying to ensure that all date-related information in the FIFF files is properly anonymized when converting to BIDS and before sharing the datasets. So I'm in a similar situation: only ever got in touch with this stuff because I was using MNE-BIDS. :)

agramfort commented 4 years ago

I think meas_id is meant to be saved but the file_id is supposed to be unique to a file. That's what I remember from conversations. Maybe @larsoner knows more.

hoechenberger commented 4 years ago

@agramfort

I think meas_id is meant to be saved but the file_id is supposed to be unique to a file

@larsoner earlier said,

It does seem incorrect that these are not consistent.

So I believe they should always stay in sync

hoechenberger commented 4 years ago

So I believe they should always stay in sync

I mean I wouldn't care, really, if I weren't concerned we might be accidentally leaking (non-anonymized) measurement info via file_id. Is this concern justified?

larsoner commented 4 years ago

Thinking about it again, it's likely that file_id is supposed to be unique for the given file / generated at write time. Maybe we should just comment about this in the code?

agramfort commented 4 years ago

Thinking about it again, it's likely that file_id is supposed to be unique for the given file / generated at write time. Maybe we should just comment about this in the code?

sounds like a good idea to me

bloyl commented 4 years ago

git grep -n FIFF_FILE_ID suggests that file id is set here during write: https://github.com/mne-tools/mne-python/blob/92254198f02e19ca9227888fd2b1d5f8a90c8084/mne/io/write.py#L292-L304

however we don't seem to call start_file with the _id argument. So the default (None) is use. which leads to DATE_NONE being used for mne-python file_ids.

see: https://github.com/mne-tools/mne-python/blob/92254198f02e19ca9227888fd2b1d5f8a90c8084/mne/io/write.py#L265-L267 https://github.com/mne-tools/mne-python/blob/92254198f02e19ca9227888fd2b1d5f8a90c8084/mne/io/write.py#L458-L464

Fundamentally, the problem is the lack of a published FIFF standard (at least that I know of). The docstring below seems to be the best we have: https://github.com/mne-tools/mne-python/blob/92254198f02e19ca9227888fd2b1d5f8a90c8084/mne/io/meas_info.py#L195

agramfort commented 4 years ago

one could look at the mne-c code how it does it

mne-tools / mne-python

How to deal with anonymization if one of the dates slips out of range? #8128