Closed plbenveniste closed 8 months ago
Also relevant: https://github.com/spinalcordtoolbox/deepseg_lesion_models/issues/2
It would be nice to move everything to git-annex -- if it is not already on git-annex, given that part (if not all!) of this dataset is under git-annex/sct-testing-large.
Also tagging @valosekj @naga-karthik who might be able to clarify
Observations from the Bidsification of the dataset
The Bidsification was done using the dataset data_ms
stored in duke/projects/ms_seg/seg_paper/data_ms
and the file dataset.pkl.
Dataset name selection:
data_ms
is good.During the transformation of the dataset, I encountered a few issues:
There are a lot of different sites (I grouped them into 28 sites : "amu", "bwh", "chb", "dou", "gle", "kar", "kor", "lyo", "mgh", "mil", "nih", "nyu", "nwu", "oxf", "par", "per", "pol", "ren", "she", "twh", "ubc", "ucl", "ucs", "unf", "unk", "van", "xua", "zur")
There is no information on how the images were acquired
Not all subjects are referenced in the dataset.pkl
file : therefore, for some subjects we don't know their pathology (which is basically the only information in the pickle file)
For the json sidecars for images, I put the following:
For the json sidecar for maks, I put the following since I don't know how they were generated ("Lesion Segmentation Manual" changes to "Segmantation Manual")
Total number of subject 683
@jcohenadad and @naga-karthik could you give me feedback since you seem to be aware of how the dataset was built ?
@plbenveniste before answering your question, can you please address my comment https://github.com/neuropoly/data-management/issues/264#issuecomment-1728492195:
It would be nice to move everything to git-annex -- if it is not already on git-annex, given that part (if not all!) of this dataset is under git-annex/sct-testing-large.
I hope you did not BIDSified a dataset that was already BIDSified and already moved to git-annex
I did look into the datasets on Git-Annex and I didn't see a dataset matching data-ms
. Afterward, I discussed it with @valosekj which confirmed that data-ms
needed to be BIDSified (maybe I misunderstood). However, now I realize that sct-testing-large
includes a lot (if not all) of subjects from data-ms
.
data-ms
been BIDSified already ? If so, by who and how ? data-ms
? If no, should we remove data-ms
from duke ?Tagging @naga-karthik as well for input on what had been done
Had data-ms been BIDSified already ?
yes, partly or entirely, as I said here https://github.com/neuropoly/data-management/issues/264#issuecomment-1728492195
If so, by who
I think Alex Foias, Charley Gros were the ones working on this. For more information about the generation of sct-testing-large
, the label-based search is useful: https://github.com/neuropoly/data-management/issues?q=is%3Aopen+is%3Aissue+label%3A%22dataset%3A+sct-testing-large%22 (although it does not cover the time before we ported these discussions on github)
and how ?
Using scripts. Some of these scripts have been improved/revamped and put here: https://github.com/neuropoly/data-management/tree/master/scripts
I went through the README of the related project and found additional information.
Do we want to have a BIDSified version of data-ms ?
Yes, but it seems like we already have, at least in part. We need to make sure that all data from data-ms have been BIDSified.
If no, should we remove data-ms from duke ?
I would say "If yes, should we remove...". And the answer is yes, probably, but we need to sit down and make sure this will not impact the reproducibility of old studies (which codes are based on a specific data structure).
NOTE: This comment contains some important information. Please take the time to read it carefully!
Okay, I have found some evidence that the data-ms
dataset on duke
might have been BIDSified.
Since the folder names under duke/projects/ms_seg/seg_paper/data_ms
seems to match the data_id
in participants.tsv
of sct-testing-large
, it appears that the dataset might have been BIDSified.
For a few subjects that I quickly checked, there also lesion masks under derivatives/labels
of sct-testing-large
. For example, for amu_2017-virginie*
set of folders under duke/projects/ms_seg/seg_paper/data_ms
, we have the following set of folders sub-amuVirginie0*
(note the different name) under the derivatives
:
So, what does this tell us? --> we need to confirm whether all subjects under duke/projects/ms_seg/seg_paper/data_ms
have successfully BIDSified and then this dataset can be deleted.
@plbenveniste There are two things:
data_id
column from participants.tsv
of sct-testing-large
and compare whether these subjects match the subjects existing under duke
? This should tell us whether the dataset has been successfully BIDSified. If it turns out that this it has been BIDSified already, then that would be absolutely great! This dataset is very valuable and can/will be used in several of our projects!
Thank you for both your input !
After investigation to compare subjects in data-ms
on duke and data_id
in sct-testing-large
, I found that only the followings subjects are not included in sct-testing-large
(14 out of 683 subjects in data-ms
):
If we decide on adding them to sct-testing-large
, I am still missing information for the json file.
Should we copy information from similar files from the same site ?
yes. Thank you @plbenveniste
@mguaypaq Could you give me writing rights for the sct-testing-large
dataset please ?
I just saw that the json sidecar for _seg-manual
are :
{ "Author": "Charley Gros",
"Label": "seg_manual", }
→ Keeping them this way to match the format of the dataset
Also, json sidecar for _lesion-manual
are empty :
→ Keeping them this way to match the format of the dataset
Also noting here: I saw that a lot of subjects in sct-testing-large
have GM segmentation file (which were not in data-ms
).
@mguaypaq Could you give me writing rights for the
sct-testing-large
dataset please ?
@plbenveniste, done, you should now be able to push non-master branches on sct-testing-large
.
Changes pushed to branch plb/add_missing_data_ms_subject
.
Changes include :
add_missing_subject_data_ms.py
participants.tsv
file Ready for review now @mguaypaq
While I was at it:
data.neuro.polymtl.ca
as dead.Then I noticed that the new image files were not properly annexed, so I started fixing this. But while doing this, I noticed some strange things, and started digging. In particular, I noticed that the following two files, which should not be the same, are byte-for-byte identical:
derivatives/labels/sub-rennesMS074/anat/sub-rennesMS074_acq-inf_T2star_lesion-manual.nii.gz
derivatives/labels/sub-montpellierLesion007/anat/sub-montpellierLesion007_acq-inf_T2star_lesion-manual.nii.gz
Looking at this file in FSLeyes, it looks like a normal lesion mask, with several non-zero voxels, so I don't think it's a case of two empty files or two files with a single voxel being the same.
I think I will have to dig a lot more to see what's going on, and which files are affected. But if anyone has ideas about what's going on, it might go faster.
Thank you sooo much for doing these in-depth checks. If there are duplications and/or wrong file names for images or segmentation, this is definitely problematic for our analyses. @plbenveniste would you mind checking this? Thanks!
I think I found out where the problem came from.
I don't know exactly how it happened, but my code successfully copied the images in sct-testing-large
, but after copying the images they were replaced by some images from Montpellier.
I had discussed this with @mguaypaq when I had the following error message for a lot of files after running my code :
[5:55](https://neuropoly.slack.com/archives/D05S39EGSEL/p1702400133530049) git-annex: git status will show ./derivatives/labels/sub-montpellierLesion007/anat/sub-montpellierLesion007_acq-inf_T2star_lesion-manual.nii.gz to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git-annex restage
I ran git-annex restage to fix it and I think that's how the images were replaced.
Currently, re-doing the modification to look into this git-annex issue
After carefully redoing the entire process, I looked into the similarity between the montpellier files and the files which I was adding.
I wrote a script to compare each file which I added to sct-testing-large
to the files of the sub-montpellier
subjects.
The script came up with the following conclusions (saying if a file is empty or identical to one file from montpellier):
Interestingly enough, it seems that each file matches with the file of exactly another subject.
Furthermore, interestingly enough, there are only 14 montpellier subjects and only 14 subjects missing from data-ms
.
Here are the information stored in the participants.tsv:
sub-montpellierLesion001 F unknown unknown MS montpellier_20170112_07 montpellierLesion sub-montpellierLesion002 F unknown unknown MS montpellier_20170112_08 montpellierLesion sub-montpellierLesion003 F unknown unknown MS montpellier_20170112_13 montpellierLesion sub-montpellierLesion004 F unknown unknown MS montpellier_20170112_14 montpellierLesion sub-montpellierLesion005 F unknown unknown MS montpellier_20170112_15 montpellierLesion sub-montpellierLesion006 F unknown unknown MS montpellier_20170112_17 montpellierLesion sub-montpellierLesion007 F unknown unknown MS montpellier_20170112_29 montpellierLesion sub-montpellierLesion008 M unknown unknown MS montpellier_20170112_31 montpellierLesion sub-montpellierLesion009 M unknown unknown MS montpellier_20170112_38 montpellierLesion sub-montpellierLesion010 F unknown unknown MS montpellier_20170112_53 montpellierLesion sub-montpellierLesion011 F unknown unknown MS montpellier_20170112_55 montpellierLesion sub-montpellierLesion012 M unknown unknown MS montpellier_20170112_59 montpellierLesion sub-montpellierLesion013 F unknown unknown MS montpellier_20170112_65 montpellierLesion sub-montpellierLesion014 M unknown unknown MS montpellier_20170112_66 montpellierLesion
Here are the information I added in the participants.tsv:
sub-rennesMS074 unknown unknown MS rennes_20170112_29 rennesMS sub-rennesMS075 unknown unknown MS rennes_20170112_17 rennesMS sub-rennesMS076 unknown unknown MS rennes_20170112_66 rennesMS sub-rennesMS077 unknown unknown MS rennes_20170112_59 rennesMS sub-rennesMS078 unknown unknown MS rennes_20170112_15 rennesMS sub-rennesMS079 unknown unknown MS rennes_20170112_13 rennesMS sub-rennesMS080 unknown unknown MS rennes_20170112_14 rennesMS sub-rennesMS081 unknown unknown MS rennes_20170112_31 rennesMS sub-rennesMS082 unknown unknown MS rennes_20170112_38 rennesMS sub-rennesMS083 unknown unknown MS rennes_20170112_07 rennesMS sub-rennesMS084 unknown unknown MS rennes_20170112_53 rennesMS sub-rennesMS085 unknown unknown MS rennes_20170112_65 rennesMS sub-rennesMS086 unknown unknown MS rennes_20170112_08 rennesMS sub-rennesMS087 unknown unknown MS rennes_20170112_55 rennesMS
Also, the description dataset from data-ms doesn't give the sex of each subject : I don't know where that information comes from.
Further work: What solution are we chosing ?
Thank you for working on making our database more reliable @plbenveniste 🙏
Hi @jcohenadad ! I was wondering which solution (of the options listed above) we were choosing ? Looking into this to close the issue.
So, if I understand correctly the issue, we don't know if these 14 subjects are from Rennes or from Montpellier, is that correct?
Based on the dataset.pkl file the images come from Rennes. But there is no way of knowing for sure which is true...
hum... ok so let's label them as Rennes and get rid of the Montpellier ones
After more consideration, I would suggest keeping the subjects from Montpellier : i.e. not changing anything. The reason is that the Montpellier subjects have more files than the corresponding Rennes subject :
For example, for sub-montpellierLesion014
the files are :
The corresponding subject sub-rennesMS076
the files are :
There is an additional file for the T2w image in the Montpellier folder.
Also for the derivatives :
Again, some label files are not present in the rennes subject folder.
Thank you @jcohenadad for your feedback.
Before closing this issue, @mguaypaq could you delete the remote branches : plb/add_missing_data_ms_subjects
and plb/add_missing_data_ms_subjects_2
.
Thanks
Done! What an adventure.
I want to use the dataset
data_ms
used to trainsct_deepseg_lesion
for training a larger model on more contrasts to segment MS lesions on the spinal cord.The dataset car currently found here
duke/projects/ms_seg/seg_paper/data_ms
: as detailed in this repositoryMaybe the dataset should be renamed ? What name ?
To do :