peng-lab / HistoBistro

Weakly-supervised learning pipeline for histopathology images. Publications: Biomarker prediction in colorectal cancer (CRC)
MIT License
65 stars 14 forks source link

Missing code (step1, 2) request #13

Open OMIC-coding opened 11 months ago

OMIC-coding commented 11 months ago

Very seminal work and detailed code for step 3 in your whole pipeline. However, codes for feature extraction and imaging data preprocessing are missing. For example, there is no description about how the h5 feature file was generated for each cohort. The results cannot be reproduced without these codes, even though the random seed was given. Could you please upload these codes. Looking forward to your reply!

sophiajw commented 10 months ago

Hi liziyu, indeed, I extracted the features with the pipeline of Jakob Kather's lab. You can access it here: https://github.com/KatherLab/marugoto/ Feature extraction with CTransPath is implemented in the branch feature_extraction
Alternatively, you can use the branch feature_extraction from this repository :) Let me know if you encounter any issues! Sophia

yuvfried commented 10 months ago

@sophiajw and @ValentinKoch, thank you for your great work and this detailed repo. I have one conceptual and one practical question regarding the combination of tile augmentation and the CtransPath feature extractor. In the paper referred to in this repo, you describe these steps:

...To reduce the impact of the staining color on the model generalization, the tiles are stain-color augmented using a structure-preserving GAN trained on TCGA.35 We extract feature representations of dimension 768 for every tile using the CTransPath model.29...

  1. Did you notice an improvement in performance when performing stain augmentation before feeding the tile into CTransPath? CTransPath authors aim to design an SSL training system with augmentations that encourage learning features from the relevant content of tiles rather than color attributes, etc. Also, they trained their model on an extremely large and diverse dataset. As such, I would expect it to address stain variations between sites.

  2. I'm aware that the code in the feature_extraction branch might not exactly reflect the preprocessing pipeline described in the paper. Yet, I didn't find the stain augmentation part in the code. You mentioned your previous HistAuGAN work, but I was curious to see how it is implemented here. Could you please point me to the part in the code corresponding to stain augmentation? In the torch transformations of CTranspath in your code, I see only the Resize and Normalize ones: https://github.com/peng-lab/HistoBistro/blob/17fd799f058bcf02d79d031a87bde9006cf615a3/models/model.py#L148-L152

sophiajw commented 10 months ago

hey @yuvfried, thank you!

  1. yes, I included the stain color augmentation because the external results improved with it, especially on the biopsy datasets. this only works if you apply the same stain augmentation/normalization to every tile of one slide. overall, I think this is because, CTransPath is only trained with normal HSV color jitter, which is not adapted to histopathology.
  2. yes, you're right. I only implemented it in the other pipeline that I mentioned above https://github.com/KatherLab/marugoto/blob/feature_extraction/marugoto/extract/extract.py#L128

Hope this helps!

OMIC-coding commented 10 months ago

2. https://github.com/KatherLab/marugoto/blob/feature_extraction/marugoto/extract/extract.py#L128

The extract.py file attached to this link appears to be invalid or has been removed. Could you please provide an updated version for it?

OMIC-coding commented 9 months ago

Hi, sophiajw! Could you help me with that issue I mentioned before? Thanks~

sophiajw commented 9 months ago

hey @liziyu-000, thanks for following up! I added the feature extraction with HistAuGAN now to this repo. You find it in the branch feature_extraction. Just enable the augmented features by setting the flag --histaugan. You can download the checkpoint of the trained model here. It was trained on patches of the 7 largest submission sites of the TCGA cohorts COAD and READ.

OMIC-coding commented 4 months ago

Hi, sophiajw! I want to know if you conducted self-supervised learning on your own dataset before training your transformer model, or you just use the CTransPath with fixed weights to extract features from individual tiles. Thanks~