Open NathanMolinier opened 7 months ago
Hey @NathanMolinier! I have to say that the issue description summarizes the BIDS format quite well (much much better than the original documentation, which is hard to understand). Great job!
One minor clarification: Are you sure that the the GeneratedBy
in dataset_description.json
is correct?
I mean, I thought that the Name
refers to the person or the model that did the segmentations -- so in this case, it would be
{
"Name": "Nathan Molinier",
"Description": "Manually correction of deepseg_sc predictions"
}
Also, a very important key in this is the Date
, which should be added for all manual and automatic predictions. So, overall, I think the dataset_description
should have the following structure:
{
"BIDSVersion": "1.8.0",
"Name": "<dataset_name>",
"DatasetType": "derivative",
"GeneratedBy": [
{
"Name": "sct_deepseg_sc",
"Description": "SCT v6.1"
"Date": "202X-mm-dd HH:MM:SS"
},
{
"Name": "Nathan Molinier",
"Description": "Manually correction of deepseg_sc predictions"
"Date": "202X-mm-dd HH:MM:SS"
}
]
}
Hey @naga-karthik ! Thanks for your comment ! Regarding the .json
files, there are few considerations that need to be taken into account. We can identify actually 2 types
of .json
:
.json
sidecars (OPTIONAL) with not a lot of specifications. We could keep using these files to store informations about specific correctors including the date/time.dataset_description.json
at the root of any given derivative folder (MANDATORY). This file is used to track the different processes used to create a given derivative folder. And because the manual corrections could be performed by multiple authors, we have to use the field description to specify correctors. The date field that you added is here not relevant because it could be different for all the images in the derivative folder.
Nathan's description of the difference is good; basically we can put all the information we want in the json sidecars, but for dataset_description.json
the GeneratedBy
key has a more rigid schema. According to this part of the spec, the allowed fields are:
Name
(mandatory): Name of the pipeline or process that generated the outputs, or Manual
.Description
: Plain-text description of the pipeline or process that generated the outputs.CodeURL
: URL where the code used to generate the dataset may be found.Version
: Version of the pipelineContainer
: A sub-object with keys Type
, Tag
, URI
, to identify a docker container.Thanks, @NathanMolinier, for this initiative! The description is clear and makes sense! 👍🏻
Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes
Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g. derivatives/labels_deepseg and derivatives/labels_vertebrae).
We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, sct_deepseg_sc
, sct_deepseg_gm
, sct_label_vertebrae
, etc. Should we then have multiple derivatives
folders (one for each algorithm) as you are proposing or just a single derivatives
folder for the whole pipeline (named based on the bash script, e.g., derivatives/process_data
)?
Using multiple derivatives
folders would be a big change causing backward incompatibility with all our current processing scripts which try to locate labels under derivatives-labels
; example here.
Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using sct_deepseg_sc
or sct_propseg
or completely manually. How would we name the derivatives
folder then?
Thanks @valosekj ! Great questions !
We usually call many different algorithms within a single pipeline (e.g., process_data.sh bash script), for example, sct_deepseg_sc, sct_deepseg_gm, sct_label_vertebrae, etc. Should we then have multiple derivatives folders (one for each algorithm) as you are proposing or just a single derivatives folder for the whole pipeline (named based on the bash script, e.g., derivatives/process_data)?
For this workflow, all the files generated using this pipeline can be kept inside the same derivative folder. In this case we will need to add the different functions computed inside the dataset_description.json
file and possibly in a descriptions.tsv
file:
Regarding the name of the folder, I believe there is nothing covering that. However, I think we should agree on a naming pattern for the lab datasets.
Using multiple derivatives folders would be a big change causing backward incompatibility with all our current processing scripts which try to locate labels under derivatives-labels; example here.
I understand that this point is a bit problematic. However, if you look at all the codes created in the lab, I believe we can identify more ways to import the data than the number of people currently working in the lab. Having a different way to import the data depending on the dataset is not a good thing. I truly believe that standardizing could help us finding the "good" behavior to use/import our data.
Also, for the retrospective rearrangement of the existing datasets, we do not always know if SC seg was obtained using sct_deepseg_sc or sct_propseg or completely manually. How would we name the derivatives folder then?
Thanks to BIDS standard, this information should be located also in the dataset_description.json
file. If we cannot remember retrospectively, I believe we can specify it in a descriptions.tsv
file.
The items in the GeneratedBy
list in the derivative's dataset_description.json
should be pipelines rather than commands, I think. Since the version of the pipeline is given, that should be enough to look at the pipeline code and see which SCT commands (and versions) were used.
The dataset curation section on our intranet was updated. Feel free to give me some feedback on it @mguaypaq @naga-karthik @valosekj @jcohenadad @sandrinebedard @plbenveniste @Nilser3
Hey @NathanMolinier , Thanks you!
Maybe add a flag for denoised
images (see: the BIDS Preprocessed or cleaned data )
For example, in MP2RAGE datas we sometimes have denoised and non-denoised images. (Discussed here)
@Nilser3 please feel free to add it directly, or to suggest the change via a PR
@NathanMolinier I took the liberty to do this https://github.com/neuropoly/intranet.neuro.polymtl.ca/commit/5666e571d9ecdf594e29371ecca7295ceaf6efa9, and i'm planning to do the same for the rest of the document-- please let me know if you disagree
@NathanMolinier I took the liberty to do this neuropoly/intranet.neuro.polymtl.ca@5666e57, and i'm planning to do the same for the rest of the document-- please let me know if you disagree
I was hesitant to do that while writing the documentation, but I finally kept the format with details to reduce the overall size of the document. However, I do understand that being able to access specific features using relative links can be interesting. We could also decide to use these titles in combination with the details.
Hey @NathanMolinier , Thanks you! Maybe add a flag for
denoised
images (see: the BIDS Preprocessed or cleaned data ) For example, in MP2RAGE datas we sometimes have denoised and non-denoised images. (Discussed here)
Regarding this, I need to figure out how we should handle this case.
We could also decide to use these titles in combination with the details.
You mean, keep the headings (ie: hyperlink) sections, and below the section, use the "details" to make it look shorter? I'm not a big fan of that. It will involve more clicking. Putting myself in the shoes of someone who wants to create a BIDS dataset, they will need to go through all sub-sections, and they can do so easily from the clickable TOC, so why add more clicks?
Yes it was exactly what I meant. So, it's fine with me if we decide to switch to your version with more headings.
This is a pretty big deal:
In the past we decided to keep the contrast and add the derivatives entities (eg: _label-sc_mask
). We need to discuss the implications of changing that, because it will impact pretty much ALL our processing pipelines. Tagging @sandrinebedard @valosekj @naga-karthik @plbenveniste @Nilser3 @mguaypaq
About this:
In the past, I vaguely remember we discussed this and agreed to use mask
for soft masks as well, and do the distinction in the derivative/ child folder (eg: derivatives/labels_soft/
). Tagging @mguaypaq @sandrinebedard @naga-karthik
We also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail)
@NathanMolinier pls merge into single table: https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L361-L376
I'm not sure we should introduce the distinction between seg and label: https://github.com/neuropoly/intranet.neuro.polymtl.ca/blob/42f6a5eeae3f7c3925e0be9fa1b09dec1f6c929e/data/dataset-curation.md?plain=1#L361-L362
With our use cases, I see more confusion being introduced by future members, which means more trouble for the core team to maintain the consistency of our database
I would not introduce these two:
Instead, I would use a single suffix _label
, as we've done so far.
Regarding the fact that raw suffixes (contrasts in our case) are removed from the original image names and replaced by the desc-
entities in the derived filenames. I personally think that we should comply to this key feature if we:
About this last point, I strongly believe that standards such as BIDS are dependent on trust. Put differently, if individuals opt not to adhere entirely to the fundamental principles of a standard, we will ultimately end up with numerous BIDS-like datasets that are incompatible.
Regarding our old scripts, and the fact that they may not work on these new datasets, it is important to remember that they were created based on specific versions of the datasets and thanks to Git, it is always possible to checkout to an anterior version (commit).
Also, I tried to find related conversations on the BIDS repository and I found this issue where people are talking about an entity mod-
which could be used as well to store information about the MRI contrast.
Finally, the practice of excluding the image suffix during the creation of derived data is not a novel concept. This feature was originally described in the initial version of the documentation addressing derivative datasets (v.1.4.0).
https://github.com/neuropoly/data-management/issues/282#issuecomment-1843581535 these are excellent points, @NathanMolinier
About this:
In the past, I vaguely remember we discussed this and agreed to use
mask
for soft masks as well, and do the distinction in the derivative/ child folder (eg:derivatives/labels_soft/
). Tagging @mguaypaq @sandrinebedard @naga-karthikWe also need to keep in mind that the BIDS-derivative standard is changing quite fast, so it is possible that the "standard" will change in 1y from now (hence the pragmatic decision to also allow our internal rules to prevail)
We still added the suffix softseg
, here is an example: sub-amu01_T2w_softseg.nii.gz
that would be located under labels_softseg
Beyond thinking between _T1w
and _mod-T1w
, we need to make sure that we understand the scope of the changes. Indeed, we need first to clarify the purpose of our derivatives datasets to choose the most suited option for our research.
Currently, most of our datasets are private and are only used for our different projects mainly to train and test AI models. However, for reproducibility, all the scripts used for these projects are available on github. Therefore, changing the way we standardize/use our data will impact directly these scripts:
Workaround 1:
_T1w
Workaround 2: see mguaypaq's slides
_T1w
Workaround 3:
_mod-T1w
If we decide to go with workaround 1, we will need to clarify within all our scripts that we are not complying to BIDS derivatives format and thus people will have to modify our scripts or transform their data to comply with our own custom format. From my point of view, this will make our scripts less likely to be used and scripts like manual correction will only be used for our internal usage.
If we decide to go with workaround 3, BIDS users would be more likely interested in our scripts. Regarding entities and suffixes, having custom ones will not be as impactful because this could be easily mitigated using special parser flags.
I’m not saying that BIDS is perfect (I personally think that suffixes should be avoided) but I think that complying to a standard could positively impact the scope of our research work. Because having our own custom standard will only generate more barriers with other researchers.
Thank you for your insights @NathanMolinier . These are solid arguments. For the sake of comprehensiveness, I will copy/paste the minutes from our previous meeting:
About the derivative suffix (ie. Workaround 1 vs. 2), here are some additional pros/cons:
About the use of probseg
: We decided to use “softseg”. Related commit (that could still be updated) https://github.com/neuropoly/intranet.neuro.polymtl.ca/commit/03ca9af8753dd11795627f242350b380f4e5b890
About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary.
Some Examples:
cross ref to gslide explaining additional BIDS context
another meeting on 2023-12-12-- trying to find consensual decision about what to do:
About the derivative suffix (ie. Workaround 1 vs. 2 vs. 3):
About the use of probseg: We opt for “softseg”.
About “blabel”, “dlabel”: We decided to not use them. These are too confusing and unnecessary.
Some Examples:
(someone please continue that list)
todo:
One last question about our convention:
BIDS says that Derivative data obtained using DIFFERENT processes/workflows should be stored using DIFFERENT derivatives folders, example:
The idea is to use the dataset_description.json
to track the processes use to generate the data.
However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ?
Solution 1: We use multiple folders to track the processes used to create the data.
Solution 2: We decide to regroup all the files under a same derivative dataset called labels
as we currently do.
What do we decide ? I just want to properly update the intranet.
However, I believe we mentioned that we maybe would like to keep only one folder under derivatives ?
Yeah, I think so too. I.e., "Solution 2":
Solution 2: We decide to regroup all the files under a same derivative dataset called
labels
as we currently do.
I believe this solution is also better for bash scripts where we check if a label already exists under derivatives
(example here). It would be more complicated to handle multiple different derivatives/labels/
folders.
Solution 2: We decide to regroup all the files under a same derivative dataset called
labels
as we currently do.I believe this solution is also better for bash scripts where we check if a label already exists under
derivatives
(example here). It would be more complicated to handle multiple differentderivatives/labels/
folders.
It also means that we will have to remove the derivative folder labels_softseg
from data-multi-subject.
I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets).
on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels.
I would go with Julien's approach!
I would say a mix between solution 1 and solution 2. For most datasets, we only have binary labels. So let's call all of them 'labels' for the reason raised by @valosekj (easier to crawl inside multiple datasets).
on the other hand, in some cases, we have software specific, or different types of labels, that we would like to distinguish from (eg: labels_softseg). In those cases, I see no alternative than doing a subfolder per set of labels.
For the labels_softseg
in the spine generic multi-subject dataset, this is a special case that these segmentation are an average of 6 contrasts (so not the output of a segmentation model but using the binary segmentation in labels
+ averaging ).
IIUC, soft segmentations such as sub-001_T1w_label-SC_softseg.nii.gz
will be under labels_softseg
but binary segmentations such as sub-001_T1w_label-SC.nii.gz
will be under labels
?
These file name examples are taken from https://github.com/neuropoly/data-management/issues/282#issuecomment-1852137510
For the
labels_softseg
in the spine generic multi-subject dataset, this is a special case that these segmentation are an average of 6 contrasts (so not the output of a segmentation model but using the binary segmentation inlabels
+ averaging ).
I understand completely, that's why I believe that solution 1 should not be set aside. The contrast agnostic project will not be the last project asking for a particular data.
In fact, having special data for a specific project is not new: the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels
and this can be confusing.
the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.
isn't it what the JSON sidecar is supposed to address?
About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this?
About the canproco: i'm surprised it contains labels that are not in the same voxel space-- @plbenveniste do you have more details about this?
A note from canproco/README.md
:
Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz'). Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels. Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170 For context, see https://github.com/neuropoly/data-management/issues/197
This happened because I got inspiration from the spine-generic analysis pipeline.
the spinegeneric dataset and canproco do contain labels generated in a different image space that the one used for images, however the data still is currently stored under labels and this can be confusing.
isn't it what the JSON sidecar is supposed to address?
Yes also, but people could miss this information from the JSON sidecars, from a human perspective having separate folder may be more visual.
Also, I'm not saying that I'm completely against having all the files within the same derivative folder, but I do think that we should not completely avoid solution 1. This could be helpful for other projects like the contrast-agnostic SC segmentation.
Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz'). Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels. Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170 For context, see https://github.com/neuropoly/data-management/issues/197
This happened because I got inspiration from the spine-generic analysis pipeline.
Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space.
Right, that makes perfect sense. In any case, it looks like we are converging towards a solution where some labels are not in the same space, and the pre-processing (before training) pipeline will take care of put them back into the appropriate space.
Labels in a different space that the one used for images will also have a new entity space-XXX
associated and a reference within their JSON sidecars
Description
This issue will be used to centralize the main discussions regarding our standardization strategies. The ultimate objective is to enable the execution of BIDS validator scripts within the derivatives folders of each dataset in our collection.
BIDS conventions
1. Important links
BIDS specifications Neuropoly intranet
2. Important points regarding the
raw
dataset for MRIAs you may know, subjects folders in the raw datasets are structured as follows for MRI,
Raw structure
``` sub-With folders corresponding to subjects, [sessions] and MRI modalities.
Then, regarding the way filenames are constructed, we can identify 3 main types of elements:
Raw entities
Characterized by a key word (sub, ses, acq, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash `-` - `sub-Raw suffixes
An alphanumeric string located after all the entities following a final underscore `_` (i.e. the `Raw extensions
Files extensions: - `.nii.gz` - `.json` - `.bval` - etc.3. Important points regarding
derivatives
datasetsFirst, it is important to understand what are BIDS derivatives folders:
"Derivatives are outputs of common processing pipelines, capturing data and meta-data sufficient for a researcher to understand and (critically) reuse those outputs in subsequent processing. Standardizing derivatives is motivated by use cases where formalized machine-readable access to processed data enables higher level processing."
Basically, derivative folders have to be seen as processed data obtained from the raw dataset. With a derivative folder corresponding to DIFFERENT processes. Let's consider a fundamental workflow that aligns with the receipt of a new dataset. Let's imagine that we would like to generate spinal cord (SC) segmentions and intervertebral discs labels from this raw data using
sct_deepseg_sc
andsct_label_vertebrae
.Since the wanted outputs will be generated using TWO different algorithms, we will have to create TWO different folders under derivatives (e.g.
derivatives/labels_deepseg
andderivatives/labels_vertebrae
). Moreover, adataset_description.json
file will have to be added at the root of EACH derivative folder to keep track of all the operations applied to the data (processes used, manual corrections...).dataset_description.json
```json { "BIDSVersion": "1.8.0", "Name": "To keep a record of complex processing steps applied to the data, a descriptions.tsv file can be used.
descriptions.tsv
This file must be composed at least of two columns: - `desc_id`: labels corresponding to the desc entities (see Derivative entities below) - `description`: human readable descriptions of the processing steps This file MAY be located at the root of the derivative dataset, or at the subject or session levelThen, derivative folders follow the same structure as the
raw
folders:Derivatives structure
``` sub-With folders corresponding to subjects, [sessions] and MRI modalities.
Finally, to construct the filenames, we can identify the same 3 type of elements as before (entities, suffixes and extensions) plus 1 extra-consideration related to the raw data:
source_entities
This element corresponds to the entire source filename, with the omission of the source suffix and extension. > For MRI we need to remove the contrast from the filename (see [here](https://bids-specification.readthedocs.io/en/stable/derivatives/introduction.html#file-naming-conventions)).Derivative entities
Characterized by a key word (space, res, den, etc.) and a value (label = an alphanumeric value, index = a nonnegative integer, etc) separated with a dash `-` - `[space-Derivative suffixes
An alphanumeric string located after all the entities following a final underscore `_` : - `mask` for binary masks (0 and 1 only) - `dseg` for discrete segmentations representing multiple anatomical structures - `probseg` for probabilistic segmentations representing a single anatomical structure with values ranging from 0 to 1 - etc.Derivatives extensions
Files extensions: - `.nii.gz` - `.json` - etc.Neuropoly strategy
To overcome this standardization challenge, multiple steps have to be considered:
Neuropoly Dataset state 2023-11-13 slides
MRI Datasets
Micro Datasets
EEG Datasets