mahmoodlab / HIPT

Hierarchical Image Pyramid Transformer - CVPR 2022 (Oral)
Other
509 stars 89 forks source link

Creating patches and extracting features for [4096 x 4096] #49

Open nam1410 opened 1 year ago

nam1410 commented 1 year ago

@Richarizardd @faisalml - I appreciate your intuitive work. I have been using CLAM for quite some time, but I have encountered an obstacle as follows:

[Preface] - I use an in-house dataset, and CLAM works fine. I recently read your paper and was curious to generate the hierarchical attention maps for the custom dataset. I have the splits and features for [256 x 256] patches, but how do I connect the existing [256 x 256] to the newly extracted [4096 x 4096] features? I have read the open and closed issues. However, I am not finding a lucid explanation.

Consider a WSI with ~20000 [256 x 256] patches, and I have Resnet50 features already extracted and stored on my disk using CLAM's scripts. @Richarizardd has mentioned that I have to change [256 x 256] to [4096 x 4096] while creating patches and extracting the features. In doing this, is the hierarchy still preserved? For example, if I extract a [4096 x 4096] patch hp1, how do I correlate it with the existing [256 x 256] patches in my directory? Is it using the [x,y] coordinates? Is the trajectory of my understanding of the pre-processing reasonable? Am I missing something?

In addition to this, where do I find ViT-16 features pretrained on TCGA (ref)? Is it from https://github.com/mahmoodlab/HIPT/blob/b5f4844f2d8b013d06807375166817eeb939a5aa/1-Hierarchical-Pretraining/From%20ViT-16%20to%20ViT-256.ipynb#L29

Do I use this instead of resnet_custom in the feature extraction

Or is it from https://github.com/mahmoodlab/HIPT/blob/b5f4844f2d8b013d06807375166817eeb939a5aa/HIPT_4K/hipt_4k.py#L67

Please correct me if I am wrong @Richarizardd @faisalml. Thank you.

Anivader commented 1 year ago

Hi @nam1410,

If you just want to get the 4k-features, you can follow this notebook - https://github.com/mahmoodlab/HIPT/blob/master/HIPT_4K/HIPT_4K%20Inference%20%2B%20Attention%20Visualization.ipynb. Basically, you will need the 4096 x 4096 image regions as input and extract the corresponding 192-dim. embedding from ViT_4k-256.

This is my understanding on HIPT 4k-feature extraction process. @Richarizardd, please correct me if I am wrong -

For the 4k model, start with a 3 x 4096 x 4096 (RGB) region. What you want to do is convert this into a sequence of 256 x 256 patches by reshaping as 3 x 16 (w_256) x 16 (h_256) x 256 x 256. This can be written as B x 3 x 256 x 256, where B = (1 x 16 (w_256) x 16 (h_256)). So the "B" here should be viewed as no. of patches.

Now each of these B = 256 patches is passed into the ViT_16-256 which yields an embedding of dimension 384. So, for the entire 4096 x 4096 region, you will end up with an embedding tensor of [256, 384].

This can now be written as 1 x 384 x 16 (w_256) x 16 (h_256), which is the input to ViT_256-4096. The output then is an embedding tensor: [1 x 192].

nam1410 commented 1 year ago

" Basically, you will need the 4096 x 4096 image regions as input...." Thank you for your response, @Anivader. My question is more focused on how to get those [4096 x 4096] features. Can you have a look at the question again?

clemsgrs commented 1 year ago

they use CLAM preprocessing pipeline to extract (4096, 4096) regions you can take a look at https://github.com/clemsgrs/hs2p where I re-implemented & twisted CLAM preprocessing code