Retroactively annotate a large number of BIDS datasets at once
Retroactively annotate the phenotypic data of a large number of BIDS datasets at once.

At the moment this focuses on datasets with MRI data only.

This takes as input the datasets that are available on the datalad superdataset.

This may not reflect the latest version of all of the datasets on openneuro and openneuro derivatives.

OpenNeuro datasets:

Number of datasets: 790 with 34479 subjects including:

OpenNeuro derivatives datasets:

Number of datasets: 258 with 10582 subjects including:


Install openneuro and openneuro-derivatives using datalad

openneuro can be installed via (this will take a while):

make openneuro

openneuro derivatives can be installed via (this will take a while):

make openneuro-derivatives

Update datasets to get the latest version


listing datasets contents

run and it will create TSV file with basic info for each dataset and its derivatives.


listing the content of the participants.tsv files

Run to get a listing of all the columns present in all the participants.tsv files and a list of all the unique columns across participants.tsv files.

Run to also get a listing of all the levels in all the columns present in all the participants.tsv files.

Clone the datasets from OpenNeuro-JSONLD

The OpenNeuro-JSONLD org has augmented openneuro datasets. To clone these effectively, you can use the below command:

It uses the GH CLI:

And make sure to be logged into the CLI

gh repo list OpenNeuroDatasets-JSONLD --fork -L 500 | awk '{print $1}' | sed 's/OpenNeuroDatasets-JSONLD\///g' | parallel -j 6 git clone{}

Running the bagel-cli on bulk annotated data

The following scripts are used:


  1. (Optional) create a new Python environment with python -m venv my_env.

  2. Activate your python environment with source ./my_env/bin/activate

  3. Install the dependencies with pip install -r requirements.txt

  4. Get the latest version of the bagel-cli from Docker Hub: docker pull neurobagel/bagelcli:latest

  5. Create a directory called inputs in the repository root that contains all the datasets that will be processed with the CLI.

  6. To run the CLI in parallel across the datasets in inputs/, double check that the directory paths used by and are correct, then run:
