nipreps / fmriprep

fMRIPrep is a robust and easy-to-use pipeline for preprocessing of diverse fMRI data. The transparent workflow dispenses of manual intervention, thereby ensuring the reproducibility of the results.
https://fmriprep.org
Apache License 2.0
629 stars 292 forks source link

Store master seed of random number generators in the derivative's ``dataset_description.json`` file #2739

Open oesteban opened 2 years ago

oesteban commented 2 years ago

What would you like to see added in fMRIPrep?

This parameter is a bit inaccessible ATM, buried inside the config file. It seems relevant enough to extract into the description JSON.

I would also consider sending a PR to the BIDS specs recommending storing seeds in the derivatives description file - would that make sense?

Do you have any interest in helping implement the feature?

Yes

Additional information / screenshots

No response

effigies commented 2 years ago

I would say that the full list of arguments (excepting filters like --participant-label and --bids-filter-file, as these will be modified uselessly by per-subject runs) make sense. And extracting the random seed seems reasonable as well.

mgxd commented 2 years ago

I like the idea of propagating the arguments used - is there a standardized way to convey this within the dataset_description.json?

If not, I think the most logical place would be a new field (PipelineArgs?) within GeneratedBy.

effigies commented 2 years ago

Sounds good. Would probably just call it Args, since not everything that creates a derivative would need to be a pipeline.

oesteban commented 2 years ago

Upon some reflection, I apologize for proposing a probably bad approach and discovered some quasi-bug in our handling of seeds.

Why the approach is not great. The way fMRIPrep is designed, the master seed is a workflow-wise parameter. However, the user can choose to run one or many subjects. So, disguised under workflow-wise, the parameter (as all the command line arguments) is subject-wise in reality.

We are already having some side effects of this precisely with the dataset_description.json and had to address some problems in the past where races would happen between jobs on a cluster to write the same content. Similar issues occur with the citation boilerplate - the last being run wins.

The quasi-bug. For the same reason, when the user runs 4 subjects within one job, the four of them will use the same random seed (unless set by the user manually, which is beside the point). However, the same 4 subjects will use four different seeds if run in separate jobs. This is not a bug per-se, but it would be elegant (and keep our entropy levels unrelated to how you run the tool) to have four seeds either way.

Option 1: a subject level description.json file.

Basically almost the same as the dataset_description.json but at the root of the subject sub-01/.

However, I believe a better option would be:

Option 2: a dataset-level TSV file.

Instead, we could have a .tsv file at the root of the fMRIPrep outputs with each subject's details:

participant_id master_seed date duration command_line
sub-01 230939834 Mar 23, 2022 14:45:34.14 fmriprep ... --anat-only
sub-02 389435433 Mar 23, 2022 12:40:45.15 fmriprep ...
sub-01 454985493 Mar 23, 2022 4:42:10.54 fmriprep ...

As you can see, we could also use it as some sort of provenance tracking resource, so that you can understand here, e.g., if you are first running all subjects with --anat-only or not.

The columns I propose are just an example and probably more thinking should be put into it.

Current BIDS / BIDS-Derivatives

I'm not aware of any BIDS metadata file that would allow us to do either option. But if the tsv doesn't exist right now (e.g., making participant_id to be nonunique), it doesn't look like a crazy feature to request, at least for BIDS Derivatives

Side effects

Other issues such as the boilerplate could be addressed in similar ways - have the CITATION file per subject (maybe as a README) and then implement some sort of fmriprep ... group CLI that reads through them and calculates the m.c.d. That's the approach of MRIQC for the group reports, and I don't particularly love it, but it would probably be better than what we currently do.

oesteban commented 2 years ago

BTW, being able to retrieve the seed of a prior --anat-only run would be beneficial when running subjects in the split way, so that the same seed is used by the full pipeline, which should be equivalent to running the full version originally.

oesteban commented 2 years ago

Related to the above issues: #1311

oesteban commented 2 years ago

@effigies @mgxd any feedback on this?

mgxd commented 2 years ago

Yes, I think this is a worthwhile addition. I'm envisioning it as either fmriprep ... group, or as a standalone script (added as an entrypoint). For simplicity, it should be run after all participant level runs have finished (that way, no need to worry about concurrent read/writes).

For the logic itself, we can parse <outdir> / sub-* / log / <uuid> / fmriprep.toml and set up a table like you mocked, adding uuid as a column as well so we can a column of unique keys. Additionally, we can check if there are any crashfiles present, and add an errors_present column or something similar to easily convey failed runs. Other potential options we may want to extract as columns: anat_only, reuse_anatomical_derivatives,