nyukat / mammography_metarepository

Meta-repository of screening mammography classifiers
https://arxiv.org/abs/2108.04800
BSD 2-Clause "Simplified" License
64 stars 11 forks source link

CMMD dataset - scan/exams dates #18

Closed gonzaq94 closed 2 years ago

gonzaq94 commented 2 years ago

Hi,

I am working with the CMMD dataset and found that there is an inconsistency between the scans/exams date and the reported date of the exams. In the official page of the CMMD (dataset https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70230508) and in the paper that you published, you indicated that the exams were conducted between 2012 and 2016. However, when I downloaded the images with the NBIA data driver, I find that there are only two possible exams date: 18/07/2021 and 18/07/2011 (this is the date of the exams that can be seen in each image and in the folders that contain them). I suspect that there is a problem with these exams dates. Do you know if it is possible to obtain the real exam dates for each exam? How did you manage patients that have both benign & malignant exam results?

Thanks in advance for your help!

Gonzalo

jwitos commented 2 years ago

Hi @gonzaq94. The information about exams being acquired between 2012 and 2016 is taken from description on the official TCGA page that you listed. You are correct that the DICOM files have only two possible dates. It seems that all DICOMs have been anonymized with Basic Application Level Confidentiality Profile, according to the DICOM Standard PS 3.15 Annex E. This includes the "Retain Longitudinal Temporal Information Modified Dates Option", as described in Section E.3.6.

In summary, it is not possible to obtain original exam dates from those DICOM files. I would suggest reach out to authors of the dataset.

How did you manage patients that have both benign & malignant exam results?

In our classifiers we really only care about malignant versus non-malignant discrimination. If the label in CMMD dataset is both "benign and malignant", then it is malignant. Let me know if that makes sense?

gonzaq94 commented 2 years ago

Hi @jwitos thank you very much for your clear answer. You are right, I checked the images and they have the Longitudinal Temporal Information attribute set to MODIFIED, which means that the date has been modified (as clearly described in Section E.3.6 of the DICOM standard).

I have another question regarding this dataset. The dataset is divided into subsets D1 and D2, where the D2 subset also contains the lesion subtype. My question is, are these two subsets disjoints regarding the patients that they contain ? For instance, the patients D1-0028 and D2-0028 are in reality two different patients, or they are just the same patient for which we have two different exams or scans ? This information is relevant if we want to make a train-test-validation split between the clients, to avoid including the same patient in different splits.

Thanks again!

Gonzalo

jwitos commented 2 years ago

@gonzaq94 Hi Gonzalo, this is an interesting question but I'm afraid outside of the scope of my knowledge and the scope of repository. I'd suggest you reach out to CMMD authors/maintainers or to TCIA. I'd be interested in hearing an answer to this too if you learn more about this.

rickymwalsh commented 2 years ago

For instance, the patients D1-0028 and D2-0028 are in reality two different patients, or they are just the same patient for which we have two different exams or scans ?

@gonzaq94 Not sure if you've already found the answer to this, but looking at the clinical data file it seems that the IDs are not related, as D1-0028 has age 35 and D2-0028 has age 44. This seems to be the case for the other IDs too. Furthermore, there are just over 1,026 unique 'D1' IDs and 749 D2 IDs, so to get the total of 1,775 subjects as claimed in the dataset description, they would have to be separate sets.