UK - Biobank New BIDS dataset

mpompolas commented 3 years ago

ULTIMATE GOAL - Create a new REPO of UK-BioBank

For the purpose of this new BIDS dataset, we want to keep the final preprocessed files, and the derivatives that correspond to them (a gradient-corrected scan has a different segmentation than the original).

The new BIDS folder should appear as an identical copy of UK-Biobank (same number of files AND same LABELS) but within a different folder name: e.g. UK_BioBank_processed, and also have the derivatives that were manually checked.

BEFORE MANUAL CHECK

Sandrine's pipeline seems ready to go. At this stage, I suggest we keep all the intermediate files for easy identification of potential problems. If space becomes an issue on Joplin we reevaluate: maybe do it in batches.

AFTER MANUAL CHECK

We should have files within the /UK_BioBank_processed/derivatives folder. Labels should be without RPI,gradcorr etc. suffixes, so on your code when you add the suffix _manual, make sure your strip those off.

Regarding the anatomy files (not the derivatives), we want to keep the last file of the pre-processing only, with the same name as the original: e.g. Instead of: sub-1000252_T2w_RPI_r_gradcorr.nii.gz it should be sub-1000252_T2w.nii.

This will make things very easy for later processing through the Ivadomed pipeline. So to sum it up:

Rename the reoriented/resampled file to what the original was,
Delete the rest of the processing files *RPI, *RPI_r_gradcorr etc..

NOTES

A few more files are needed for a complete BIDS folder: dataset_description.json and participants.json (you only have participants.tsv) - Maybe a README.TXT as well(?). Just copy these from the original UK-BioBank dataset.

The preprocessing steps should be documented somewhere: The easiest place would in the dataset_description. Document git-version of SpinalCordToolbox and the function calls that were used with their parameters. Another place could be the .json that is associated to each .nii.gz but that is a bit more work. There is also the gradcorr file that needs to be documented somehow.... Don't have any input on that. As a start, maybe document which facility it came from(?)

jcohenadad commented 3 years ago

thank you for initiating this @mpompolas, few precisions:

the repos should be under git-annex (not on duke)
the repos name should be the same as the original repos (unprocessed) with added suffix: -processed

sandrinebedard commented 3 years ago

We should have files within the /UK_BioBank_processed/derivatives folder. Labels should be without RPI,gradcorr etc. suffixes, so on your code when you add the suffix _manual, make sure your strip those off.

@mpompolas So I can add to this branch a modified version of my script for manual corrections manual_correction.py so the output name of manual correction would be for example sub-1000032_T1w_seg-manual.nii.gz instead of sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz directly, is that right?

mpompolas commented 3 years ago

the repos should be under git-annex (not on duke) the repos name should be the same as the original repos (unprocessed) with added suffix: -processed

Thanks @jcohenadad , just edited my instructions.

So I can add to this branch a modified version of my script for manual corrections manual_correction.py so the output name of manual correction would be for example sub-1000032_T1w_seg-manual.nii.gz instead of sub-1000032_T1w_RPI_r_gradcorr_seg-manual.nii.gz directly, is that right?

@sandrinebedard exactly. For creating this dataset, we will solely use code from this branch.

sandrinebedard commented 3 years ago

I had some thoughts about the datasets we want to create. We talked about the fact that the derivatives folder would only be in the UK_BioBank_processed dataset. However, my pipeline for cord CSA takes as an input the raw images and also manual segmentation and disc label in the derivatives. So there will be a problem if the derivatives are in UK_BioBank_processed.

ideas:

I could modify my process_data.sh to take in the new dataset, so removing steps of resampling, reorientation and gradcorr but we would have to create the dataset before I can run my pipeline
Would it be possible to have the same derivatives folder associated to both datasets or something like that?

@jcohenadad do you have some thoughts on this?

jcohenadad commented 3 years ago

@sandrinebedard good point.

I could modify my process_data.sh to take in the new dataset, so removing steps of resampling, reorientation and gradcorr but we would have to create the dataset before I can run my pipeline

I would lean towards this approach. You could e.g. break down your shell script and create a preprocess_data.sh, which deals with gradcorr, resampling. That script could also deal with renaming (ie remove the suffix "_gradcorr_r" as we discussed, so that the output data is "clean" of suffix and can be used as a "native" BIDS dataset for other projects (eg model training).

Would it be possible to have the same derivatives folder associated to both datasets or something like that?

I would advise against it. I'm afraid we will end up with out-of-sync derivatives (eg. segmentation manually corrected in dataset1 but we forgot to update it in dataset2).

mpompolas commented 3 years ago

You could e.g. break down your shell script and create a preprocess_data.sh, which deals with gradcorr, resampling. That script could also deal with renaming (ie remove the suffix "_gradcorr_r" as we discussed, so that the output data is "clean" of suffix and can be used as a "native" BIDS dataset for other projects (eg model training).

I agree with @jcohenadad on splitting the script into two parts.

Would it be possible to have the same derivatives folder associated to both datasets or something like that?

The idea is to completely separate the original from the preprocessed dataset. If we put segmentations within the same folder from multiple datasets (I assume you would differentiate them with a suffix) it will become complicated later on to differentiate which ones we will use for training since we tend to have a standardized suffix in all datasets "_seg-manual" or "_labels-disk-manual" etc. This standardization will make things very easy when we need to select multiple Datasets/BIDS folders as inputs in training.

sandrinebedard commented 3 years ago

I would lean towards this approach. You could e.g. break down your shell script and create a preprocess_data.sh, which deals with gradcorr, resampling. That script could also deal with renaming (ie remove the suffix "_gradcorr_r" as we discussed, so that the output data is "clean" of suffix and can be used as a "native" BIDS dataset for other projects (eg model training).

@jcohenadad @mpompolas I agree, splitting the script seems like the best idea, I will get into it!

sct-pipeline / ukbiobank-spinalcord-csa