can pipeline be more robust to handling missing data?

drammock commented 2 months ago

another "I have a longitudinal study" question here. I have some subjects with missing timepoints (AKA missing sessions). I want the pipeline to look at the config file (which says subjects = "all" and sessions = ["a", "b", "c"]) and then look at the bids data tree and process all the subjectXsession combinations that are present, without crashing because subject 101 is missing session "b". In practice, such errors take the form of:

┌────────┬ init/_02_find_empty_room ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
│13:31:56│ 🐛 sub-101 ses-b A critical error occurred. The error message was: Missing file: /redacted/redacted/bids-data/sub-101/ses-b/meg/sub-101_ses-b_task-redacted_split-01

which is confusing; it errors when looking for a split partial file, when the real problem is that there isn't even a .../sub-101/ses-b/ directory.

Proposed solutions

just make the error message better, and do something complicated in the config where I figure out which subject-session combos I have, and run the pipeline multiple times (e.g. once for each session, with different lists of subjects each time)
blithely add an allow_missing=True here, and hope that it doesn't have catastrophic downstream effects:

https://github.com/mne-tools/mne-bids-pipeline/blob/61f29616f5e3c2e246365f9568ae2c168616617c/mne_bids_pipeline/steps/init/_02_find_empty_room.py#L45

it seems like there are a lot of places where we do for subject in subjects and for session in sessions where subjects and sessions are taken straight from the config file each time. So it won't work to do a simple Path(...).exists() type of check to see if the entire session folder is missing, right about here:

https://github.com/mne-tools/mne-bids-pipeline/blob/61f29616f5e3c2e246365f9568ae2c168616617c/mne_bids_pipeline/steps/init/_01_init_derivatives_dir.py#L79-L81

But we could do that check, but then store in a private config location the subjectXsession combos that are actually present, and use that info to get subjects/sessions to loop over, throughout the codebase.

Open to other ideas!

agramfort commented 2 months ago

I've also run into this issue... I remember that BIDS was considering this as problematic datasets but it's such a prevalent usecase...

Message ID: @.***>

hoechenberger commented 2 months ago

@sappelhoff Do you know if and how BIDS supports data with missing sessions?

hoechenberger commented 2 months ago

Reading the specs: https://bids-specification.readthedocs.io/en/stable/longitudinal-and-multi-site-studies.html https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files.html#sessions-file

I don't see anything that would disallow missing sessions. But this use case isn't explicitly mentioned anywhere, either…

sappelhoff commented 2 months ago

To the best of my knowledge, it is not disallowed. I also remember preparing a mixed design experiment once, where half of participants had a session that the other half did not have, and that led to a warning (to check whether this was intended), but not an error.

hoechenberger commented 2 months ago

5. But we could do that check, but then store in a private config location the subjectXsession combos that are actually present, and use that info to get subjects/sessions to loop over, throughout the codebase.

Yep that would be my proposal. However we also have runs within sessions and then stuff gets really complicated, at least I remember seeing some rather convoluted code handling runs within subjects and sessions … maybe we should re-think all of this and create a clean, clear lookup-table early on in the pipeline run and then just refer to this downstream.

agramfort commented 2 months ago

+1

it's great if it's just a warning. I have definitely seen this warning in the past.

Message ID: @.***>

drammock commented 2 months ago

To the best of my knowledge, it is not disallowed. I also remember preparing a mixed design experiment once, where half of participants had a session that the other half did not have, and that led to a warning (to check whether this was intended), but not an error.

was that warning from the BIDS validator, or from MNE-BIDS-Pipeline though? If the standard allows it, IMO the pipeline shouldn't error out.

sappelhoff commented 2 months ago

It was a warning from the BIDS validator.

If the standard allows it, IMO the pipeline shouldn't error out.

agreed

larsoner commented 2 months ago

maybe we should re-think all of this and create a clean, clear lookup-table early on in the pipeline run and then just refer to this downstream.

Sure this would work. Or (equivalently) something like get_subjects_sessions that returns list[tuple[str, str]] of the subject-session pairs on the fly just like our get_runs_tasks that returns a list[tuple[str, str]] of the run-task pairs. This would be a bit of an intermediate solution but I think it would work?

And I think it would be good to have a config.allow_missing_sessions = False (by default) that you would have to set to True to allow it to just skip missing sessions when not present. That way people won't get any unpleasant surprises thinking they're analyzing a complete dataset when really they're missing some stuff.

mne-tools / mne-bids-pipeline

can pipeline be more robust to handling missing data? #992

Proposed solutions