nyukat / GMIC

An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization
https://doi.org/10.1016/j.media.2020.101908
GNU Affero General Public License v3.0
168 stars 48 forks source link
breast-cancer breast-cancer-diagnosis breast-cancer-screening deep-learning medical-imaging pytorch

An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization

Introduction

This is an implementation of the Globally-Aware Multiple Instance Classifier (GMIC) model as described in our paper. The architecture of the proposed model is shown below.

Highlights of GMIC:

The implementation allows users to obtain breast cancer predictions and visualization of saliency maps by applying one of our pretrained models. We provide weights for 5 GMIC-ResNet-18 models. The model is implemented in PyTorch.

alt text

Update (2021/03/08): Updated the documentation

Update (2020/12/15): Added the preprocessing pipeline.

Update (2020/12/16): Added the example notebook.

Prerequisites

License

This repository is licensed under the terms of the GNU AGPLv3 license.

How to run the code

You need to first install conda in your environment. Before running the code, please run pip install -r requirements.txt first. Once you have installed all the dependencies, run.sh will automatically run the entire pipeline and save the prediction results in csv. Note that you need to first cd to the project directory and then execute . ./run.sh. When running the individual Python scripts, please include the path to this repository in your PYTHONPATH.

We recommend running the code with a GPU. To run the code with CPU only, please change DEVICE_TYPE in run.sh to 'cpu'.

The following variables defined in run.sh can be modified as needed:

You should obtain the following outputs for the sample exams provided in the repository (found in sample_output/predictions.csv by default).

image_index benign_pred malignant_pred benign_label malignant_label
0_L-CC 0.1356 0.0081 0 0
0_R-CC 0.8929 0.3259 1 0
0_L-MLO 0.2368 0.0335 0 0
0_R-MLO 0.9509 0.1812 1 0
1_L-CC 0.0546 0.0168 0 0
1_R-CC 0.5986 0.9910 0 1
1_L-MLO 0.0414 0.0139 0 0
1_R-MLO 0.5383 0.9308 0 1
2_L-CC 0.0678 0.0227 0 0
2_R-CC 0.1917 0.0603 1 0
2_L-MLO 0.1210 0.0093 0 0
2_R-MLO 0.2440 0.0231 1 0
3_L-CC 0.6295 0.9326 0 1
3_R-CC 0.2291 0.1603 0 0
3_L-MLO 0.6304 0.7496 0 1
3_R-MLO 0.0622 0.0507 0 0

Data

sample_data/images contains 4 exams each of which includes 4 the original mammography images (L-CC, L-MLO, R-CC, R-MLO). All mammography images are saved in png format. The original 12-bit mammograms are saved as rescaled 16-bit images to preserve the granularity of the pixel intensities, while still being correctly displayed in image viewers.

sample_data/segmentation contains the binary pixel-level segmentation labels for some exams. All segmentations are saved as png images.

sample_data/exam_list_before_cropping.pkl contains a list of exam information. Each exam is represented as a dictionary with the following format:

{'horizontal_flip': 'NO',
  'L-CC': ['0_L-CC'],
  'L-MLO': ['0_L-MLO'],
  'R-MLO': ['0_R-MLO'],
  'R-CC': ['0_R-CC'],
  'best_center': {'R-CC': [(1136.0, 158.0)],
   'R-MLO': [(1539.0, 252.0)],
   'L-MLO': [(1530.0, 307.0)],
   'L-CC': [(1156.0, 262.0)]},
  'cancer_label': {'benign': 1,
   'right_benign': 0,
   'malignant': 0,
   'left_benign': 1,
   'unknown': 0,
   'right_malignant': 0,
   'left_malignant': 0},
  'L-CC_benign_seg': ['0_L-CC_benign'],
  'L-CC_malignant_seg': ['0_L-CC_malignant'],
  'L-MLO_benign_seg': ['0_L-MLO_benign'],
  'L-MLO_malignant_seg': ['0_L-MLO_malignant'],
  'R-MLO_benign_seg': ['0_R-MLO_benign'],
  'R-MLO_malignant_seg': ['0_R-MLO_malignant'],
  'R-CC_benign_seg': ['0_R-CC_benign'],
  'R-CC_malignant_seg': ['0_R-CC_malignant']}

In their original formats, images from L-CC and L-MLO views face right, and images from R-CC and R-MLO views face left. We horizontally flipped R-CC and R-MLO images so that all four views face right. Values for L-CC, R-CC, L-MLO, and R-MLO are list of image filenames without extensions and directory name.

Preprocessing

Run the following commands to crop mammograms and calculate information about augmentation windows.

Crop mammograms

python3 src/cropping/crop_mammogram.py \
    --input-data-folder $DATA_FOLDER \
    --output-data-folder $CROPPED_IMAGE_PATH \
    --exam-list-path $INITIAL_EXAM_LIST_PATH  \
    --cropped-exam-list-path $CROPPED_EXAM_LIST_PATH  \
    --num-processes $NUM_PROCESSES

src/import_data/crop_mammogram.py crops the mammogram around the breast and discards the background in order to improve image loading time and time to run segmentation algorithm and saves each cropped image to $PATH_TO_SAVE_CROPPED_IMAGES/short_file_path.png using h5py. In addition, it adds additional information for each image and creates a new image list to $CROPPED_IMAGE_LIST_PATH while discarding images which it fails to crop. Optional --verbose argument prints out information about each image. The additional information includes the following:

Calculate optimal centers

python3 src/optimal_centers/get_optimal_centers.py \
    --cropped-exam-list-path $CROPPED_EXAM_LIST_PATH \
    --data-prefix $CROPPED_IMAGE_PATH \
    --output-exam-list-path $EXAM_LIST_PATH \
    --num-processes $NUM_PROCESSES

src/optimal_centers/get_optimal_centers.py outputs new exam list with additional metadata to $EXAM_LIST_PATH. The additional information includes the following:

Outcomes of preprocessing

After the preprocessing step, you should have the following files in the $OUTPUT_PATH directory (default is sample_output):

Reference

If you found this code useful, please cite our paper:

An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization\ Yiqiu Shen, Nan Wu, Jason Phang, Jungkyu Park, Kangning Liu, Sudarshini Tyagi, Laura Heacock, S. Gene Kim, Linda Moy, Kyunghyun Cho and Krzysztof J. Geras\ Medical Image Analysis 2020

@article{shen2020interpretable, 
title={An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization},
author={Shen, Yiqiu and Wu, Nan and Phang, Jason and Park, Jungkyu and Liu, Kangning and Tyagi, Sudarshini and Heacock, Laura and Kim, S Gene and Moy, Linda and Cho, Kyunghyun and others},
journal={Medical Image Analysis},
pages={101908},
year={2020},
publisher={Elsevier}

}

Reference to previous GMIC version:

Globally-Aware Multiple Instance Classifier for Breast Cancer Screening\ Yiqiu Shen, Nan Wu, Jason Phang, Jungkyu Park, S. Gene Kim, Linda Moy, Kyunghyun Cho and Krzysztof J. Geras\ Machine Learning in Medical Imaging - 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Proceedings. Springer , 2019. p. 18-26 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 11861 LNCS).

@inproceedings{shen2019globally, 
title={Globally-Aware Multiple Instance Classifier for Breast Cancer Screening},
    author={Shen, Yiqiu and Wu, Nan and Phang, Jason and Park, Jungkyu and Kim, Gene and Moy, Linda and Cho, Kyunghyun and Geras, Krzysztof J},
    booktitle={Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings},
    volume={11861},
    pages={18-26},
    year={2019},
    organization={Springer Nature}}