neuropoly / data-management

Repo that deals with datalad aspects for internal use
4 stars 0 forks source link

canproco: Confusing file names between manually corrected derivatives and native data #197

Open jcohenadad opened 1 year ago

jcohenadad commented 1 year ago

In version "8ef0446783959b61af044bd5ec7f6e07653fc3df" of the dataset, the labels (SC seg and disc) are created with the same prefix as the image, but incorporate preprocessing, hence resolutions don't match. Example:

Image:

sct_image -i sub-cal078/ses-M0/anat/sub-cal078_ses-M0_T2w.nii.gz -header
...
dim     [3, 56, 512, 512, 1, 1, 1, 1]

Label:

sct_image -i derivatives/labels/sub-cal078/ses-M0/anat/sub-cal078_ses-M0_T2w_labels-manual.nii.gz -header 
...
dim     [3, 56, 320, 320, 1, 1, 1, 1]
jcohenadad commented 1 year ago

For the UKB, we had a source dataset, and a processed dataset. Both were git-annexed. The labels were generated in the processed data. This is fine for internal management, because students have equal access to both, and can use the processed data for e.g. training model.

Note that a script to go from native to processed data (usually called preprocess_data.sh) is stored at github.

Problem: In some cases (eg: spine-generic), we want the segmentations to be available for external researcher who don't have easy access to the processed data. Therefore, they should be archived in the native (public) dataset. This is what we did for spine-generic.

Problem: If the labels are done with the processed data, they have ugly suffixes, eg: XXX_r_RPI_seg.nii.gz which we don't want to carry because:

Solution: We change the suffix of the generated labels to be 'cleaner', eg: sub-001_T2w_seg-manual.nii.gz and we add a README under derivatives/labels/ (the name "labels/" could be different) that explains how the labels were created and with what processing script. Ideally a link to the processing script (pointing to fixed GitHub version) should be included in the README. Example: README. Linked issue: https://github.com/spine-generic/data-multi-subject/issues/124

Problem: It requires students to generate the processed data if they want to play with it.

Solution: We generate a processed dataset and git-annex it alongside the script that generated the dataset (as was done for spine-generic). --> The location of the processing could be under "code/".

TODO: @valosekj add to general doc

TODO: SCENARIO 2: OR who already have done segmentations on the native data and would like to share them. --> find solution for that.

TODO: SCENARIO 3: preprocessing changes, what to do with the archived segs? re-do them?

TODO: call everthing label

TODO: setup convention for labels (eg: seg-manual, disc labels-- cross ref existing issue on spinegeneric, etc.) Linked issue: https://github.com/spine-generic/data-multi-subject/issues/95

jcohenadad commented 1 year ago

TODO for canproco: add README with URL to processing script (GH hashtag)

valosekj commented 1 year ago

TODO: @valosekj add to general doc

Info added in https://github.com/neuropoly/intranet.neuro.polymtl.ca/commit/92c53781ad39620d05c0106fdba569fef31b8a20

TODO for canproco: add README with URL to processing script (GH hashtag)

Added in https://github.com/neuropoly/data-management/issues/206. Namely, I added the following to data.neuro.polymtl.ca/canproco/README:

Note: T2w sagittal images were preprocessed before the spinal cord was segmented ('_T2w_seg-manual.nii.gz') and disc labels were identified ('_T2w_labels-manual.nii.gz').
Namely, reorientation to RPI and resampling to 0.8mm isotropic voxel were performed. This is why the original T2w images have different dimensions than segmentations and labels.
Preprocessing steps: https://github.com/ivadomed/canproco/blob/8e1b2c35f96eeeb3838b512dd93eba25e5a5e97a/scripts-t2w_csa/sct-preprocess_data.sh#L162-L170