[ENH] Methodology for Inputting Desired Data Base Structure

JRandy77 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

New feature

Taking inspiration from file-tree, I think we should adopt their methodology for templating data structure. The current methodology requires less quantity of input from user, but I think it's more confusing and potentially more prone to mistakes that would be hard to notice.

Example of Current Specification

[Categorization]
regular expression for file identifier = _.*$
regular expression for directory identifier = ^.*-

Then the rest of the handling of distinct file names is handled by the program and basically creates a new category for any file/directory name that is unique after trimming using the specified regular expression. This makes it very adaptable to whatever structure of database you input, but requires more interpretation by the user to find errors.

Example of file-tree templating

paper_name = my-first-project, a-better-project, best-work-ever
version = first, final, real-final, after-feedback, final-final, last

papers
    {paper_name}
        version-{version}
            manuscript.md
            references.bib
            {paper_name}.pdf (output)

This method requires the user to explicitly write out the file structure using indentations to indicate parent/child relationships and bracketed key-words to indicates areas of variability. i.e. Subject number. I think using a templating file like this will help inform the program to give more specific feedback. It does require a little bit more effort from the user, but example templates can be provided and comments can be included in the template file using "#" to denote a commented line. I will attach a bids example at the bottom.

Possible extension using file-tree

It is possible create these template files using file-tree, with some more advanced functionality, which could be used to limit the amount of manual writing of template files by user. However, some work would have to be done in order to improve usability.

BIDS Raw example

ext=.nii.gz

dataset_description.json
participants.tsv
README (readme)
CHANGES (changes)
LICENSE (license)
genetic_info.json
sub-{participant}
    [ses-{session}]
        sub-{participant}_sessions.tsv (sessions_tsv)
        anat (anat_dir)
            sub-{participant}[_ses-{session}][_acq-{acq}][_ce-{ce}][_rec-{rec}][_run-{run_index}]_{modality}{ext} (anat_image)
            sub-{participant}[_ses-{session}][_acq-{acq}][_ce-{ce}][_rec-{rec}][_run-{run_index}][_mod-{modality}]_defacemask{ext} (anat_deface)
        func (func_dir)
            sub-{participant}[_ses-{session}]_task-{task}[_acq-{acq}][_ce-{ce}][_dir-{dir}][_rec-{rec}][_run-{run_index}][_echo-{echo}]_bold{ext}  (task_image)
            sub-{participant}[_ses-{session}]_task-{task}[_acq-{acq}][_ce-{ce}][_dir-{dir}][_rec-{rec}][_run-{run_index}][_echo-{echo}]_sbref{ext} (task_sbref)
            sub-{participant}[_ses-{session}]_task-{task}[_acq-{acq}][_ce-{ce}][_dir-{dir}][_rec-{rec}][_run-{run_index}][_echo-{echo}]_events.tsv  (task_events)
            sub-{participant}[_ses-{session}]_task-{task}[_acq-{acq}][_ce-{ce}][_dir-{dir}][_rec-{rec}][_run-{run_index}][_echo-{echo}][_recording-{recording}]_physio.tsv.gz (task_physio)
            sub-{participant}[_ses-{session}]_task-{task}[_acq-{acq}][_ce-{ce}][_dir-{dir}][_rec-{rec}][_run-{run_index}][_echo-{echo}][_recording-{recording}]_stim.tsv.gz (task_stim)
        dwi (dwi_dir)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_dwi{ext} (dwi_image)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_dwi.bval (bval)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_dwi.bvec (bvec)
        fmap (fmap_dir)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_phasediff{ext} (fmap_phasediff)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_magnitude{ext} (fmap_mag)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_magnitude1{ext} (fmap_mag1)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_magnitude2{ext} (fmap_mag2)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_phase1{ext} (fmap_phase1)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_phase2{ext} (fmap_phase2)
            sub-{participant}[_ses-{session}][_acq-{acq}][_run-{run_index}]_fieldmap{ext} (fmap)
            sub-{participant}[_ses-{session}][_acq-{acq}]_dir-{dir}[_run-{run_index}]_epi{ext} (fmap_epi)
        meg (meg_dir)
            sub-{participant}[_ses-{session}]_task-{task}[_run-{run}][_proc-{proc}]_meg.{meg_ext} (meg)
        eeg (eeg_dir)
            sub-{participant}[_ses-{session}]_task-{task}[_run-{run}][_proc-{proc}]_eeg.{eeg_ext} (eeg)
        ieeg (ieeg_dir)
            sub-{participant}[_ses-{session}]_task-{task}[_run-{run}][_proc-{proc}]_ieeg.{ieeg_ext} (ieeg)
        beh (behavioral_dir)
            sub-{participant}[_ses-{session}]_task-{task}_events.tsv (behavioural_events)
            sub-{participant}[_ses-{session}]_task-{task}_beh.tsv (behavioural)
            sub-{participant}[_ses-{session}]_task-{task}_physio.tsv.gz (behavioural_physio)
            sub-{participant}[_ses-{session}]_task-{task}_stim.tsv.gz (behavioral_stim)

This does look like a lot and would take some time to write, but if example templates are provided this would mitigate amount of work needed to be done.

Unclear documentation

No response

JRandy77 commented 1 year ago

It also allows some flexibility in what is considered an error. For example if one subject only did one session so it might not have "session" folders, while another might. This can be handled gracefully because variable written within [] square brackets are optional.

Remi-Gau commented 1 year ago

Also I am pretty sure that the bids RAW template can be generated from the BIDS schema so we do not have to write it by hand.

Those that would be nice to have that are BIDS related are the ones for the output of fmriprep.

Here is a an example of what this looks like:

https://github.com/nipreps/fmriprep/blob/master/.circleci/ds005_bids_outputs.txt

JRandy77 commented 1 year ago

Is that an exhaustive list of ouputs by fmri prep, or is it possible there are more?

Remi-Gau commented 1 year ago

I am pretty sure there can be "more" but they should at least follow a general pattern a bit like this:

sub-{sub}
    [ses-{ses}]
        func
            sub-{sub}[_ses-{ses}]_task-{task}[_acq-{acq}][_ce-{ce}][_dir-{dir}][_rec-{rec}][_run-{run_index}][_space-{space}][_res-{res}][_den-{label}][_desc-{desc}]_bold.nii.gz

meaning that you could have filenames with "chunks" that do not appear in this linked example of an output (like acq or rec) but that are valid

Does that make sense? Does that help?

JRandy77 commented 1 year ago

Yes that does, that looks pretty much the same as bids. I'm working on creating a file-tree template tree file for fMRI prep. I'm gonna do some more digging on my own, but I may have some more questions for you later. Do you know where I can find some more example outputs of fmri prep? Perhaps on the BIC server?

Remi-Gau commented 1 year ago

look in this github orga: https://github.com/OpenNeuroDerivatives

every repo is a "pointer" to a dataset many of them are fmriprep outputs

Remi-Gau commented 1 year ago

but you may need to install datalad on your machine if you want to be able to clone them without blowing your hard drives

https://handbook.datalad.org/en/latest/intro/installation.html

neurodatascience / file_tree_check