wasserth / TotalSegmentator

Tool for robust segmentation of >100 important anatomical structures in CT and MR images
Apache License 2.0
1.51k stars 248 forks source link

Question about data from Imaging Data Commons used for training/testing the MR model #313

Open deepakri201 opened 5 months ago

deepakri201 commented 5 months ago

Hi Jakob and co-authors,

Thank you for your contribution for creating a model to segment MR structures! My lab is very excited to try it out.

I was curious about the data that you used from Imaging Data Commons. In your supplementary material S4, you listed 21 collections from IDC that you used, and in the paper you mentioned you used data from 47 patients.

In your metadata csv, would it be possible to include more identifiable information about which exact patients (and if applicable, corresponding segmentations) were used from the 21 IDC collections? Perhaps you could include the SeriesInstanceUID of the data as well as for the segmentations.

We would like to try to try your model on more data from IDC, and would like to make sure that the it does not overlap with data used for training.

Thank you!

Deepa

wasserth commented 5 months ago

Hi, I will try to add the UIDs for the 47 IDC images.

wasserth commented 5 months ago

After converting the Dicoms to niftis using dcm2niix I got the following IDs. Does that help?

A790978_19010329074605
C3L_00598_20030324112250
C3L_00609_20081111114347
C3L_00629_20000423034146
C3L_00800_20081130125502
C3L_00817_20080516070458
C3L_01551_20090209122408
C3L_02213_20090803144956
C3L_03166_20100704120544
C3L_03196_20100607135517
C3L_03197_20100418081749
C3N_02262_20000817085034
CMB_CRC_MSB_09151_19591104154630
CMB_LCA_MSB_02291_19600703053558
CMB_MML_MSB_06305_19600919164606
TCGA_4Z_AA89_20051221174823
TCGA_B0_4849_19861206140643
TCGA_B8_5158_20031204132252
TCGA_B8_5551_20040119120454
TCGA_BP_4170_19870113103029
TCGA_BP_4349_19880711132054
TCGA_BP_4351_19900715152407
TCGA_BP_4760_19880513123045
TCGA_BP_4799_19910422082552
TCGA_BP_5010_19910728194200
TCGA_BP_5176_19880407092855
TCGA_BP_5178_19880417150021
TCGA_BP_5185_19910129151222
TCGA_BP_5201_19920106133425
TCGA_CW_5587_19910324145732
TCGA_CZ_4860_19961215094534
TCGA_CZ_5454_19971002163204
TCGA_CZ_5460_19980112073655
TCGA_CZ_5989_19971121170616
TCGA_DD_A113_19981209065923
TCGA_DD_A11D_19981018110804
TCGA_DD_A1ED_20020809101604
TCGA_DD_A1EL_20020207132615
TCGA_DD_A4NO_20000526075403
TCGA_DD_A4NP_19990121101720
TCGA_DD_A4NQ_20000914101353
TCGA_DK_A2I6_19921030130853
TCGA_DV_A4VX_19971125095930
TCGA_DV_A4VZ_19970726101349
TCGA_DW_5561_20000526134440
TCGA_DW_7839_20020304093051
TCGA_DW_7841_20030908075313
deepakri201 commented 5 months ago

Hi,

Hmm it looks like those may be a combination of PatientID and some sort of date. I think the unique SeriesInstanceUID would be the most helpful. The CT volume will have a SeriesInstanceUID, and the DICOM SEG object will also have its own SeriesInstanceUID. You could use pydicom to get these values:

import pydicom
from pydicom import dcmread
ds = dcmread("your_dcm_file.dcm")
print(ds.SeriesInstanceUID) 

Thanks!

deepakri201 commented 4 months ago

Hi Jakob,

I wanted to follow up on the previous comment, if it would be possible to obtain the SeriesInstanceUIDs?

Also it would be great to match the SeriesInstanceUIDs to the IDs that you used (s0001, s0002, etc).

Thank you!

Deepa

fedorov commented 4 months ago

@wasserth I think it is quite important to know precisely what data was used to train the model. IDC makes it possible to very very easily retrieve the images identified by DICOM UIDs. All you have to do is provide the list of those UIDs and the IDC data release version. It would be great if we could work together to help gather and share this information and by doing this improve transparency of your training process.

wasserth commented 4 months ago

When I downloaded all the files I got one directory with thousands of dicom slices in it. It was not easily possible to see which ones make up one 3d volume. I ran dcm2niix on the directly and luckily it figured out all the files and generated a list of 3d nifti files. From these files I selected a subset of images. They have the names which i showed earlier. I do not really know how to go back from these names to the DICOM UIDs. For each file dcm2niix also generated the following json file. But this also does not seem to contain the DICOM UID:

{
    "Modality": "MR",
    "MagneticFieldStrength": 1.5,
    "ImagingFrequency": 63.8695,
    "Manufacturer": "GE",
    "ManufacturersModelName": "Signa HDxt",
    "DeviceSerialNumber": "000000000000GEHC",
    "BodyPartExamined": "BLADDER",
    "PatientPosition": "FFS",
    "ProcedureStepDescription": "RESSONANCIA MAGNETICA DE ABDOME INFERIOR",
    "SoftwareVersions": "15\\LX\\MR Software release:15.0_M4A_0947.a",
    "MRAcquisitionType": "2D",
    "SeriesDescription": "AX T2 FRFSE",
    "ProtocolName": "AX T2 FRFSE",
    "ScanningSequence": "SE",
    "SequenceVariant": "SK\\OSP",
    "ScanOptions": "FAST_GEMS\\NPW\\TRF_GEMS\\FILTERED_GEMS",
    "ImageType": ["ORIGINAL", "PRIMARY", "OTHER"],
    "SeriesNumber": 3,
    "AcquisitionTime": "17:50:41.000000",
    "AcquisitionNumber": 1,
    "ConvolutionKernel": "STANDARD",
    "Unit": "BQML",
    "DecayCorrection": "START",
    "AttenuationCorrectionMethod": "measured,, 0.096000 cm-1,",
    "ReconstructionMethod": "OSEM",
    "FrameTimesStart": [
        0   ],
    "SliceThickness": 5,
    "SpacingBetweenSlices": 6,
    "SAR": 1.6359,
    "EchoTime": 0.118044,
    "RepetitionTime": 7.26667,
    "RepetitionTimeExcitation": 0.0053688,
    "FlipAngle": 90,
    "CoilString": "8Ch Body Lower",
    "PercentPhaseFOV": 100,
    "PercentSampling": 100,
    "EchoTrainLength": 26,
    "AcquisitionMatrixPE": 192,
    "ReconMatrixPE": 512,
    "PixelBandwidth": 162.773,
    "PhaseEncodingAxis": "i",
    "ImageOrientationPatientDICOM": [
        0.999951,
        0.00979658,
        0.00161749,
        -0.00979642,
        0.999935,
        0.00589007  ],
    "InPlanePhaseEncodingDirectionDICOM": "ROW",
    "ConversionSoftware": "dcm2niix",
    "ConversionSoftwareVersion": "v1.0.20201102"
}
fedorov commented 4 months ago

@wasserth thank you for the explanation. This definitely helps understand how matching UIDs is not trivial for your processing approach.

We have an idea how to match those, and will explore this and update this issue.

We will also improve download process from IDC Portal so that you do not have to deal with one directory with thousands of DICOM slices in it.

vkt1414 commented 3 months ago

@wasserth do you mind sharing the dcm2niix command you used?

did you only use the compression argument as in -z y? Thank you!

fedorov commented 3 months ago

I don't know why I didn't think about this earlier, but I think I figured it out. It appears that the strings shared in https://github.com/wasserth/TotalSegmentator/issues/313#issuecomment-2147306072 are formed as concatenation of PatientID, StudyDate and StudyTime, except dashes in PatientID are replaced with underscores (don't know why).

@vkt1414 here's the query

WITH
  selected AS (
  SELECT
    REPLACE(CONCAT(PatientID,'_',CAST(StudyDate AS string FORMAT 'YYYYMMDD'),CAST(StudyTime AS string format 'HHMISS')),'-','_') AS filename
  FROM
    `bigquery-public-data.idc_current.dicom_all`)
SELECT
  DISTINCT(filename)
FROM
  selected
WHERE
  # exact name CMB_CRC_MSB_09151_19591104154630
  filename LIKE "CMB_CRC_MSB_09151%"
ORDER BY
  filename

The only difference appears to be in the shift in the HH part of the StudyTime (and maybe it is because of locale time zone or something like that, don't know). For the query above, the result is this:

image

fedorov commented 3 months ago

After converting the Dicoms to niftis using dcm2niix I got the following IDs. A790978_19010329074605 C3L_00598_20030324112250 C3L_00609_20081111114347 C3L_00629_20000423034146

@wasserth can you provide the mapping from the s0001 etc case IDs in your Zenodo entry to the IDs of the kind you listed above?

deepakri201 commented 2 months ago

After converting the Dicoms to niftis using dcm2niix I got the following IDs. A790978_19010329074605 C3L_00598_20030324112250 C3L_00609_20081111114347 C3L_00629_20000423034146

@wasserth can you provide the mapping from the s0001 etc case IDs in your Zenodo entry to the IDs of the kind you listed above?

Hi @wasserth,

We wanted to know if it would be possible to address the above issue. Thank you!

Deepa