reginabarzilaygroup / Sybil

Deep Learning for Lung Cancer Risk Prediction using LDCT
MIT License
62 stars 38 forks source link

Metada format for dataloader #25

Closed sanan222 closed 9 months ago

sanan222 commented 9 months ago

Hello,

I intend to use json file for loading my data. In the loading part there is a lack of information for sample_metadata. I looked through nlst.py folder under loading folder, but if you can share the sample_metadata.json file with some random numbers and format I would appreciate it.

Best Regards

pgmikhael commented 9 months ago

Hi,

Thanks for reaching out!

The file should be a list of dictionaries for each patient. The dictionaries are organized hierarchically:

patient metadata > list of exams > dictionary per series > series information

Here is more specifically what it looks like. Included as well an example JSON with all the relevant dictionary keys and random/empty values: nlst_sample.json

{
  "pid": "XYZ",                                                                    # PATIENT ID
  "split": "test",                                                                 # SPLIT
  "accessions": [                                                                  # LIST of EXAMS
    {                                                                                  # DICT for EXAM 1
      "exam": "exam_id_timepoint",                                                     # EXAM ID + TIMEPOINT
      "accession_number": "exam_id",                                                   # EXAM ID
      "screen_timepoint": "timepoint",                                                 # TIMEPOINT
      "date": "YYYYMMDD",                                                              # EXAM DATE
      "image_series": {                                                                # DICT of SERIES
        "series_id1": {                                                                    # DICT for SERIES 1
          "paths": ["/path/to/slice1.png", "/path/to/slice2.png", "/path/to/slice3.png"],      # LIST of PATHS to DICOMs/PNGs
          "slice_location": [3,1,2],                                                           # SLICE LOCATIONS from DICOM METADATA
          "slice_number": [3,1,2],                                                             # SLICE NUMBERS from DICOM METADATA
          "img_position": [3,1,2],                                                             # IMAGE POSITION from DICOM METADATA
          "pixel_spacing": [0.703125, 0.703125],                                               # PIXEL SPACING from DICOM METADATA
          "slice_thickness": 2.5,                                                              # SLICE THICKNESS from DICOM METADATA
          "series_data": {                                                                     # DICT of SERIES METADATA
            "reconfilter": ["STANDARD"],                                                           # RECONSTRUCTION FILTER from NLST 
            "reconthickness": [2.5],                                                               # RECONSTRUCTION THICKNESS from NLST 
            "manufacturer": [1],                                                                   # MANUFACTURER from NLST 
            ...                                                                                    # OTHER METADATA from NLST
          },
        },
        "series_id2": {}                                                                   # DICT for SERIES 2
      },
      "abnormalities": {                                                               # DICT of ABNORMALITIES from NLST                                                   
        "sct_ab_desc": [51],                                                               # SCT ABNORMALITY DESCRIPTION from NLST 
        "sct_ab_num": [1],                                                                 # SCT ABNORMALITY NUMBER from NLST 
        ...                                                                                # OTHER ABNORMALITY DATA
      },
    },
    {},                                                                                # DICT for EXAM 2
    {}                                                                                 # DICT for EXAM 3
  ],
  "pt_metadata": {                                                                 # DICT of PATIENT METADATA from NLST
    "race": [X],                                                                       # RACE 
    "cigsmok": [0],                                                                    # CIGARETTE SMOKING  
    "candx_days": [45],                                                                # DAYS TO CANCER DIAGNOSIS                                                                                                                                        
    ...
  }
}
sanan222 commented 9 months ago

Thank you very much for detailed explanation. That will work for us. Furthermore, I would like to ask about the file and the folder order of your NLST dataset. If you can share that too I would highly appreciate.

pgmikhael commented 9 months ago

Hi,

I'm unsure what is meant by file and folder order.

sanan222 commented 9 months ago

Let me clarify for you. I want to try the train.py code and I need to load the dataset. Normally in NLST datasets the file orders are very complicated. I will show one dataset folder structure that I found on the Internet below.

image

Here I think that the first folder shows PID number, the second one shows Exam ID, and the last one contains CT scan results. I wonder what do these folders look like in your NLST dataset.

pgmikhael commented 9 months ago

Hi,

It follows a similar structure, but the directory structure shouldn't matter if the JSON is configured as above. What matters is that every series has the list of paths to the PNG/DICOM images, and those are then loaded during training. In the most simplified setting, a sample in a training batch just requires the image paths and the label.

sanan222 commented 8 months ago

Thank you for your help, Peter! This info will work for me.