lmxmercy / MGCT

Mutual-Guided Cross-Modality Transformer
5 stars 1 forks source link

MGCT

The PyTorch implementation for the paper:

MGCT: Mutual-Guided Cross-Modality Transformer for Survival Outcome Prediction using Integrative Histipathology-Genomic Features. (Accepted by IEEE BIBM 2023) arXiv

Pre-requisites

Data Preparation

Dataset Downloading

Whole Slide Images Processing

To process the WSI data, we used the CLAM WSI-analysis toolbox open-source repository. First, the tissue regions in each biopsy slide are segmented. The 256 x 256 patches without spatial overlapping are extracted from the segmented tissue regions at the desired magnification. Consequently, a pretrained truncated ResNet50 is used to encode raw image patches into 1024-dim feature vector. Using the CLAM toolbox, the features are saved as matrices of torch tensors of size N x 1024, where N is the number of patches from each WSI (varies from slide to slide). Please refer to CLAM for the details on tissue segmentation and feature extraction. In this paper, we worked on the 20X magnification of Whole Slide Images and the parameter setting can be found here. The extracted features then serve as input (in a .pt file) to the network. The following folder structure is assumed for the extracted features vectors:

DATA_ROOT_DIR/
    └──TCGA_BLCA/
        ├── slide_1.pt
        ├── slide_2.pt
        └── ...
    └──TCGA_BRCA/
        ├── slide_1.pt
        ├── slide_2.pt
        └── ...
    └──TCGA_GBMLGG/
        ├── slide_1.pt
        ├── slide_2.pt
        └── ...
    └──TCGA_LUAD/
        ├── slide_1.ptd
        ├── slide_2.pt
        └── ...
    └──TCGA_UCEC/
        ├── slide_1.pt
        ├── slide_2.pt
        └── ...
    ...

DATA_ROOT_DIR is the base directory of all datasets / cancer type (e.g. the directory to your SSD). Within DATA_ROOT_DIR, each folder contains a list of .pt files for that dataset / cancer type.

Training-Validation Splits

For evaluating the algorithm's performance, we randomly partitioned each dataset using 5-fold cross-validation which same as the PORPOISE. Splits for each cancer type are found in the splits/5foldcv folder, which each contain splits_{k}.csv for k = 1 to 5. In each splits_{k}.csv, the first column corresponds to the TCGA Case IDs used for training, and the second column corresponds to the TCGA Case IDs used for validation. Alternatively, one could define their own splits, however, the files would need to be defined in this format.

Running Experiments

python main.py --data_root_dir <PATH TO WSIs> --split_dir <CANCER TYPE SPLITS> --model_type mgct --fusion concat --mode coattn --stage1_num_layers 1 --stage2_num_layers 2