neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

Standardizing Neuropoly datasets (One more step towards BIDS) #282

Open NathanMolinier opened 7 months ago

NathanMolinier commented 7 months ago

Description

This issue will be used to centralize the main discussions regarding our standardization strategies. The ultimate objective is to enable the execution of BIDS validator scripts within the derivatives folders of each dataset in our collection.

Dataset specific issues related to their standardization should be linked to this issue as well.

BIDS conventions

1. Important links

BIDS specifications Neuropoly intranet

2. Important points regarding the raw dataset for MRI

[Brackets] are characterizing optional informations

As you may know, subjects folders in the raw datasets are structured as follows for MRI,

Raw structure ``` sub-

With folders corresponding to subjects, [sessions] and MRI modalities.

Then, regarding the way filenames are constructed, we can identify 3 main types of elements:

Raw entities Characterized by a key word (sub, ses, acq, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash `-` - `sub-
Raw suffixes An alphanumeric string located after all the entities following a final underscore `_` (i.e. the ``) --> corresponding in our cases to the MRI contrast: - `T1w` - `MP2RAGE` - `dwi` - etc.
Raw extensions Files extensions: - `.nii.gz` - `.json` - `.bval` - etc.

3. Important points regarding derivatives datasets

First, it is important to understand what are BIDS derivatives folders:

"Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing. Standardizing derivatives is motivated by use cases where formalized machine-readable access to processed data enables higher level processing."

Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes. Let's consider a fundamental workflow that aligns with the receipt of a new dataset. Let's imagine that we would like to generate spinal cord (SC) segmentions and intervertebral discs labels from this raw data using sct_deepseg_sc and sct_label_vertebrae.

Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g. derivatives/labels_deepseg and derivatives/labels_vertebrae). Moreover, a dataset_description.json file will have to be added at the root of EACH derivative folder to keep track of all the operations applied to the data (processes used, manual corrections...).

dataset_description.json ```json { "BIDSVersion": "1.8.0", "Name": "", "DatasetType": "derivative", "GeneratedBy": [ { "Name": "sct_deepseg_sc", "Version": "SCT v6.1" }, { "Name": "Manual", "Description": "Manually corrected by Nathan Molinier" } ] } ```

To keep a record of complex processing steps applied to the data, a descriptions.tsv file can be used.

descriptions.tsv This file must be composed at least of two columns: - `desc_id`: labels corresponding to the desc entities (see Derivative entities below) - `description`: human readable descriptions of the processing steps This file MAY be located at the root of the derivative dataset, or at the subject or session level

Then, derivative folders follow the same structure as the raw folders:

Derivatives structure ``` sub-

With folders corresponding to subjects, [sessions] and MRI modalities.

Finally, to construct the filenames, we can identify the same 3 type of elements as before (entities, suffixes and extensions) plus 1 extra-consideration related to the raw data:

⚠️ Entities and suffixes are different from those used with the raw filenames and are specific to data types.

source_entities This element corresponds to the entire source filename, with the omission of the source suffix and extension. > For MRI we need to remove the contrast from the filename (see [here](https://bids-specification.readthedocs.io/en/stable/derivatives/introduction.html#file-naming-conventions)).
Derivative entities Characterized by a key word (space, res, den, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash `-` - `[space-]`: image space if different from raw space: template space (i.e. MNI305 etc), individual, study etc. (see [BIDS](https://bids-specification.readthedocs.io/en/stable/appendices/coordinate-systems.html) for allowed spaces) - `[res-
Derivative suffixes An alphanumeric string located after all the entities following a final underscore `_` : - `mask` for binary masks (0 and 1 only) - `dseg` for discrete segmentations representing multiple anatomical structures - `probseg` for probabilistic segmentations representing a single anatomical structure with values ranging from 0 to 1 - etc.
Derivatives extensions Files extensions: - `.nii.gz` - `.json` - etc.

Neuropoly strategy

To overcome this standardization challenge, multiple steps have to be considered:

Neuropoly Dataset state 2023-11-13 slides

MRI Datasets

Micro Datasets

EEG Datasets

naga-karthik commented 7 months ago

Hey @NathanMolinier! I have to say that the issue description summarizes the BIDS format quite well (much much better than the original documentation, which is hard to understand). Great job!

One minor clarification: Are you sure that the the GeneratedBy in dataset_description.json is correct? I mean, I thought that the Name refers to the person or the model that did the segmentations -- so in this case, it would be

{
      "Name": "Nathan Molinier",
      "Description": "Manually correction of deepseg_sc predictions"
    }

Also, a very important key in this is the Date, which should be added for all manual and automatic predictions. So, overall, I think the dataset_description should have the following structure:

{
  "BIDSVersion": "1.8.0",
  "Name": "<dataset_name>",
  "DatasetType": "derivative",
  "GeneratedBy": [
    {
      "Name": "sct_deepseg_sc",
      "Description": "SCT v6.1"
      "Date": "202X-mm-dd HH:MM:SS"
    },
    {
      "Name": "Nathan Molinier",
      "Description": "Manually correction of deepseg_sc predictions"
      "Date": "202X-mm-dd HH:MM:SS"
    }
  ]
}
NathanMolinier commented 7 months ago

Hey @naga-karthik ! Thanks for your comment ! Regarding the .json files, there are few considerations that need to be taken into account. We can identify actually 2 types of .json:

The date field that you added is here not relevant because it could be different for all the images in the derivative folder.

mguaypaq commented 7 months ago

Nathan's description of the difference is good; basically we can put all the information we want in the json sidecars, but for dataset_description.json the GeneratedBy key has a more rigid schema. According to this part of the spec, the allowed fields are:

valosekj commented 7 months ago

Thanks, @NathanMolinier, for this initiative! The description is clear and makes sense! 👍🏻

Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes

Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g. derivatives/labels_deepseg and derivatives/labels_vertebrae).

We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, sct_deepseg_sc, sct_deepseg_gm, sct_label_vertebrae, etc. Should we then have multiple derivatives folders (one for each algorithm) as you are proposing or just a single derivatives folder for the whole pipeline (named based on the bash script, e.g., derivatives/process_data)?

Using multiple derivatives folders would be a big change causing backward incompatibility with all our current processing scripts which try to locate labels under derivatives-labels; example here.

Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using sct_deepseg_sc or sct_propseg or completely manually. How would we name the derivatives folder then?

NathanMolinier commented 7 months ago

Thanks @valosekj ! Great questions !

We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, sct_deepseg_sc, sct_deepseg_gm, sct_label_vertebrae, etc. Should we then have multiple derivatives folders (one for each algorithm) as you are proposing or just a single derivatives folder for the whole pipeline (named based on the bash script, e.g., derivatives/process_data)?

For this workflow, all the files generated using this pipeline can be kept inside the same derivative folder. In this case we will need to add the different functions computed inside the dataset_description.json file and possibly in a descriptions.tsv file:

dataset_description.json ```json { "BIDSVersion": "1.9.0", "Name": "", "DatasetType": "derivative", "GeneratedBy": [ { "Name": "sct_deepseg_sc", "Version": "SCT v6.1" }, { "Name": "sct_deepseg_gm", "Version": "SCT v6.1" }, { "Name": "sct_label_vertebrae", "Version": "SCT v6.1" }, { "Name": "Manual", "Description": "Manually corrected by Nathan Molinier" } ] } ```

Regarding the name of the folder, I believe there is nothing covering that. However, I think we should agree on a naming pattern for the lab datasets.

Using multiple derivatives folders would be a big change causing backward incompatibility with all our current processing scripts which try to locate labels under derivatives-labels; example here.

I understand that this point is a bit problematic. However, if you look at all the codes created in the lab, I believe we can identify more ways to import the data than the number of people currently working in the lab. Having a different way to import the data depending on the dataset is not a good thing. I truly believe that standardizing could help us finding the "good" behavior to use/import our data.

Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using sct_deepseg_sc or sct_propseg or completely manually. How would we name the derivatives folder then?

Thanks to BIDS standard, this information should be located also in the dataset_description.json file. If we cannot remember retrospectively, I believe we can specify it in a descriptions.tsv file.

mguaypaq commented 7 months ago

The items in the GeneratedBy list in the derivative's dataset_description.json should be pipelines rather than commands, I think. Since the version of the pipeline is given, that should be enough to look at the pipeline code and see which SCT commands (and versions) were used.

NathanMolinier commented 7 months ago

The dataset curation section on our intranet was updated. Feel free to give me some feedback on it @mguaypaq @naga-karthik @valosekj @jcohenadad @sandrinebedard @plbenveniste @Nilser3

Nilser3 commented 7 months ago

Hey @NathanMolinier , Thanks you! Maybe add a flag for denoised images (see: the BIDS Preprocessed or cleaned data ) For example, in MP2RAGE datas we sometimes have denoised and non-denoised images. (Discussed here)

jcohenadad commented 7 months ago

@Nilser3 please feel free to add it directly, or to suggest the change via a PR

jcohenadad commented 7 months ago

@NathanMolinier I took the liberty to do this https://github.com/neuropoly/intranet.neuro.polymtl.ca/commit/5666e571d9ecdf594e29371ecca7295ceaf6efa9, and i'm planning to do the same for the rest of the document-- please let me know if you disagree

NathanMolinier commented 7 months ago

@NathanMolinier I took the liberty to do this neuropoly/intranet.neuro.polymtl.ca@5666e57, and i'm planning to do the same for the rest of the document-- please let me know if you disagree

I was hesitant to do that while writing the documentation, but I finally kept the format with details to reduce the overall size of the document. However, I do understand that being able to access specific features using relative links can be interesting. We could also decide to use these titles in combination with the details.

NathanMolinier commented 7 months ago

Hey @NathanMolinier , Thanks you! Maybe add a flag for denoised images (see: the BIDS Preprocessed or cleaned data ) For example, in MP2RAGE datas we sometimes have denoised and non-denoised images. (Discussed here)

Regarding this, I need to figure out how we should handle this case.

jcohenadad commented 7 months ago

We could also decide to use these titles in combination with the details.

You mean, keep the headings (ie: hyperlink) sections, and below the section, use the "details" to make it look shorter? I'm not a big fan of that. It will involve more clicking. Putting myself in the shoes of someone who wants to create a BIDS dataset, they will need to go through all sub-sections, and they can do so easily from the clickable TOC, so why add more clicks?

NathanMolinier commented 7 months ago

Yes it was exactly what I meant. So, it's fine with me if we decide to switch to your version with more headings.

jcohenadad commented 7 months ago

This is a pretty big deal:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/0b1d9580da994af05ef7b40e38996062034fc385/data/dataset-curation.md?plain=1#L341

In the past we decided to keep the contrast and add the derivatives entities (eg: _label-sc_mask). We need to discuss the implications of changing that, because it will impact pretty much ALL our processing pipelines. Tagging @sandrinebedard @valosekj @naga-karthik @plbenveniste @Nilser3 @mguaypaq

jcohenadad commented 7 months ago

About this:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/0b1d9580da994af05ef7b40e38996062034fc385/data/dataset-curation.md?plain=1#L363

In the past, I vaguely remember we discussed this and agreed to use mask for soft masks as well, and do the distinction in the derivative/ child folder (eg: derivatives/labels_soft/). Tagging @mguaypaq @sandrinebedard @naga-karthik

We also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail)

jcohenadad commented 7 months ago

Is this relevant to us? https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L350

jcohenadad commented 7 months ago

@NathanMolinier pls merge into single table: https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L361-L376

jcohenadad commented 7 months ago

I'm not sure we should introduce the distinction between seg and label: https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L361-L362

With our use cases, I see more confusion being introduced by future members, which means more trouble for the core team to maintain the consistency of our database

jcohenadad commented 7 months ago

I would not introduce these two:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L364-L365

Instead, I would use a single suffix _label, as we've done so far.

NathanMolinier commented 6 months ago

Regarding the fact that raw suffixes (contrasts in our case) are removed from the original image names and replaced by the desc- entities in the derived filenames. I personally think that we should comply to this key feature if we:

About this last point, I strongly believe that standards such as BIDS are dependent on trust. Put differently, if individuals opt not to adhere entirely to the fundamental principles of a standard, we will ultimately end up with numerous BIDS-like datasets that are incompatible.

Regarding our old scripts, and the fact that they may not work on these new datasets, it is important to remember that they were created based on specific versions of the datasets and thanks to Git, it is always possible to checkout to an anterior version (commit).

Also, I tried to find related conversations on the BIDS repository and I found this issue where people are talking about an entity mod- which could be used as well to store information about the MRI contrast.

Finally, the practice of excluding the image suffix during the creation of derived data is not a novel concept. This feature was originally described in the initial version of the documentation addressing derivative datasets (v.1.4.0).

jcohenadad commented 6 months ago

https://github.com/neuropoly/data-management/issues/282#issuecomment-1843581535 these are excellent points, @NathanMolinier

sandrinebedard commented 6 months ago

About this:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/0b1d9580da994af05ef7b40e38996062034fc385/data/dataset-curation.md?plain=1#L363

In the past, I vaguely remember we discussed this and agreed to use mask for soft masks as well, and do the distinction in the derivative/ child folder (eg: derivatives/labels_soft/). Tagging @mguaypaq @sandrinebedard @naga-karthik

We also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail)

We still added the suffix softseg , here is an example: sub-amu01_T2w_softseg.nii.gz that would be located under labels_softseg

NathanMolinier commented 6 months ago

Beyond thinking between _T1w and _mod-T1w, we need to make sure that we understand the scope of the changes. Indeed, we need first to clarify the purpose of our derivatives datasets to choose the most suited option for our research.

Currently, most of our datasets are private and are only used for our different projects mainly to train and test AI models. However, for reproducibility, all the scripts used for these projects are available on github. Therefore, changing the way we standardize/use our data will impact directly these scripts:

Workaround 1:

Workaround 2: see mguaypaq's slides

Workaround 3:

If we decide to go with workaround 1, we will need to clarify within all our scripts that we are not complying to BIDS derivatives format and thus people will have to modify our scripts or transform their data to comply with our own custom format. From my point of view, this will make our scripts less likely to be used and scripts like manual correction will only be used for our internal usage.

If we decide to go with workaround 3, BIDS users would be more likely interested in our scripts. Regarding entities and suffixes, having custom ones will not be as impactful because this could be easily mitigated using special parser flags.

I’m not saying that BIDS is perfect (I personally think that suffixes should be avoided) but I think that complying to a standard could positively impact the scope of our research work. Because having our own custom standard will only generate more barriers with other researchers.

jcohenadad commented 6 months ago

Thank you for your insights @NathanMolinier . These are solid arguments. For the sake of comprehensiveness, I will copy/paste the minutes from our previous meeting:

About the derivative suffix (ie. Workaround 1 vs. 2), here are some additional pros/cons:

About the use of probseg: We decided to use “softseg”. Related commit (that could still be updated) https://github.com/neuropoly/intranet.neuro.polymtl.ca/commit/03ca9af8753dd11795627f242350b380f4e5b890

About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary.

Some Examples:

jcohenadad commented 6 months ago

cross ref to gslide explaining additional BIDS context

jcohenadad commented 6 months ago

another meeting on 2023-12-12-- trying to find consensual decision about what to do:

About the derivative suffix (ie. Workaround 1 vs. 2 vs. 3):

About the use of probseg: We opt for “softseg”.

About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary.

Some Examples:

https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/master/data/dataset-curation.md#examples-and-use-cases

(someone please continue that list)

todo:

NathanMolinier commented 5 months ago

One last question about our convention:

BIDS says that Derivative data obtained using DIFFERENT processes/workflows should be stored using DIFFERENT derivatives folders, example:

The idea is to use the dataset_description.json to track the processes use to generate the data.

However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ?


Solution 1: We use multiple folders to track the processes used to create the data.

Solution 2: We decide to regroup all the files under a same derivative dataset called labels as we currently do.


What do we decide ? I just want to properly update the intranet.

valosekj commented 5 months ago

However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ?

Yeah, I think so too. I.e., "Solution 2":

Solution 2: We decide to regroup all the files under a same derivative dataset called labels as we currently do.

I believe this solution is also better for bash scripts where we check if a label already exists under derivatives (example here). It would be more complicated to handle multiple different derivatives/labels/ folders.

NathanMolinier commented 5 months ago

Solution 2: We decide to regroup all the files under a same derivative dataset called labels as we currently do.

I believe this solution is also better for bash scripts where we check if a label already exists under derivatives (example here). It would be more complicated to handle multiple different derivatives/labels/ folders.

It also means that we will have to remove the derivative folder labels_softseg from data-multi-subject.

jcohenadad commented 5 months ago

I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets).

on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels.

sandrinebedard commented 5 months ago

I would go with Julien's approach!

I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets).

on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels.

For the labels_softseg in the spine generic multi-subject dataset, this is a special case that these segmentation are an average of 6 contrasts (so not the output of a segmentation model but using the binary segmentation in labels + averaging ).

naga-karthik commented 5 months ago

IIUC, soft segmentations such as sub-001_T1w_label-SC_softseg.nii.gz will be under labels_softseg but binary segmentations such as sub-001_T1w_label-SC.nii.gz will be under labels ?

These file name examples are taken from https://github.com/neuropoly/data-management/issues/282#issuecomment-1852137510

NathanMolinier commented 5 months ago

For the labels_softseg in the spine generic multi-subject dataset, this is a special case that these segmentation are an average of 6 contrasts (so not the output of a segmentation model but using the binary segmentation in labels + averaging ).

I understand completely, that's why I believe that solution 1 should not be set aside. The contrast agnostic project will not be the last project asking for a particular data.

In fact, having special data for a specific project is not new: the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.

jcohenadad commented 5 months ago

the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.

isn't it what the JSON sidecar is supposed to address?

About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this?

valosekj commented 5 months ago

About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this?

A note from canproco/README.md:

Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz'). Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels. Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170 For context, see https://github.com/neuropoly/data-management/issues/197

This happened because I got inspiration from the spine-generic analysis pipeline.

NathanMolinier commented 5 months ago

the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.

isn't it what the JSON sidecar is supposed to address?

Yes also, but people could miss this information from the JSON sidecars, from a human perspective having separate folder may be more visual.

Also, I'm not saying that I'm completely against having all the files within the same derivative folder, but I do think that we should not completely avoid solution 1. This could be helpful for other projects like the contrast-agnostic SC segmentation.

jcohenadad commented 5 months ago

Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz'). Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels. Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170 For context, see https://github.com/neuropoly/data-management/issues/197

This happened because I got inspiration from the spine-generic analysis pipeline.

Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space.

NathanMolinier commented 5 months ago

Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space.

Labels in a different space that the one used for images will also have a new entity space-XXX associated and a reference within their JSON sidecars