Script to take BIDS dataset and generate MONAI-compatible JSON list of subjects for training

ivadomed / utilities

Repository containing utility scripts for handling various aspects in a DL model training pipeline

MIT License

8 stars 0 forks source link

Script to take BIDS dataset and generate MONAI-compatible JSON list of subjects for training #21

Open jcohenadad opened 1 year ago

jcohenadad commented 1 year ago

Talking with @louisfb01 I realized that the lab does not have a procedure for training MONAI models from BIDS dataset, and instead convert the data physically, which is problematic because:

~it duplicates the data (more space on HD)~ EDIT: actually it does not (see Naga's comment below)
~the dataset used for training is not synced anymore (which defeats the purpose of version-tracking our datasets for keeping provenance in trained models)~ EDIT: again wrong-- see Naga's comment below

I know the MONAI folks have been working with BIDS compatibility. Can people please link in this GH discussion thread all the existing ressources, and also discuss strategies for the lab to come up with a unified protocol/script for preparing a JSON file for MONAI training.

The solution should accommodate the aggregation of multiple BIDS datasets.

Some resource:

@naga-karthik 's script to physically convert data to MONAI format
@jcohenadad 's script to train a model with MONAI based on BIDS dataset

naga-karthik commented 1 year ago

Thanks for opening the issue! It seems that there's some misunderstanding in what conversion scripts are doing.

it duplicates the data (more space on HD)

No, it does NOT duplicate the data. The MSD conversion script is just a pointer to the original, version-tracked bids dataset. This line shows that the output is just a .json file containing the paths to the image and labels of the original bids dataset

the dataset used for training is not synced anymore

Based on what I wrote above, since the output is json file point to the bids dataset, the script only takes the latest paths to the bids dataset. There is NO duplication of the datasets anywhere.

What do I mean by "pointing to the original bids dataset"? here's a screenshot of how the json file looks:

example MSD json

As you can see, we're referring to images in the root folder of the dataset and the labels in the derivatives folder.

Hope this clarifies some things a bit!

jcohenadad commented 1 year ago

Hope this clarifies some things a bit!

It does! Thanks a lot @naga-karthik ! Your solution is exactly what we need. I just would like to make it more visible to the lab, eg create a template script in this repos maybe?

naga-karthik commented 1 year ago

I created something like that here (and the students in the lab do know that the conversion scripts exist).

create a template script in this repos maybe?

It's pretty hard to create a template script that just works in a plug-and-play manner. The suffixes, contrasts, sessions, etc. are just too different with the kind of the datasets we have so the script I linked above is just meant to be a starting off point. The students would have to look at the code, make tiny modifications depending on how their data looks (I also make it a bit easier by adding TODOs for where to add stuff).

jcohenadad commented 1 year ago

The fact that @louisfb01 started of with @naga-karthik 's script (instead of starting from scratch) is evidence that having at least a script to start from is better than no script at all, and therefore is a justification to put something in this repos and redirect students to it (and, importantly, improve that script over time).