Closed clemsgrs closed 1 year ago
Hi @clemsgrs - thank you for the detailed post, and will do my best to respond.
patch_size
argument passed when instantiating VisionTransformer4K
but not used?Apologies, but it was a typo. I wanted to make the token size an argument for instantiating VisionTransformer4K
, but ended up making the arguments for VisionTransformer4K
the same as the regular VisionTransformer
class as there is no change in ViT sequence length complexity. Whether it is 256-sized images with 16-size patching or 4096-sized images with 256-sized patching, the sequence length is always 16*16=256
. To be more exact, technically the image size for VisionTransformer4K
should be 3584 while the patch size is 256, as during pretraining, as the maximum global crop size is [14 x 14]
in a [16 x 16 x 384]
2D grid of pre-extracted feature embeddings of 256-sized patches. However, since VisionTransformer4K
doesn't actually take in 3584/4096-sized images but rather the 2D grid of pre-extracted feature embeddings, it was easier to keep VisionTransformer4K
the same as VisionTransformer
. I will fix some of these arguments so that it is less confusing.
img_size
argument passed when instantiating VisionTransformer4K
is left as default (i.e. img_size = [224]
) and not set to [256]
...?See above comment. In addition, I would note that as in the original VisionTransformer
from DINO, we can't set img_size = [256]
for instantiating VisionTransformer4K
as the images are trained with img_size = [224]
and thus, the sequence length in self.pos_embed
is (224/16)**2+1 = 197
. If you change img_size=[256]
, you would not be able to load in the pretrained weights. Despite some of the typos, everything ended up working in a roundabout way as the ViT complexities are consistent across image resolutions, but apologies for confusion!
B
(supposed to account for the batch size) and M
(number of [4096, 4096]
regions per slide)? Shouldn't the cls_token
tensor be of shape [batch_size, 1, 192]
?I am not sure what the confusion is and may need clarification. However, I would say that in training the local aggregation (of 256-sized features to learn 4K-sized features) in HIPT, you can treat the number of [4096, 4096]
regions essentially like a "minibatch" in processing all [M x 256 x 384]
features at once. The actual "batch size" (# of WSIs) for weakly-supervised learning is 1.
I am sorry that you have not had success using the available region-level pre-extracted feature embeddings. What weakly-supervised scaffold code did you use? In this work, CLAM was used for weakly-supervised learning, which I slightly modified for HIPT. Here are the following areas in the repository that may help you in understanding the loss curves and reproducibility.
Training Logs For All Experiments: All training logs are made available in the following results directory, in which you can freely inspect via using the tensorboard.
Self-Supervised KNN Performance: This notebook provides a simple-to-perform sanity check where you can: 1) load all of the pre-extracted region embeddings, 2) take the mean, 3) plug-and-play into scikit-learn with StratifiedKFold(k=10)
and KNeighborsClassifier
. Using randomly-generated splits, the performance is roughly on-par with the splits used in the paper.
Plugging Pre-Extracted Region Embeddings into CLAM: With all of the 4096-level features pre-extracted, the problem is essentially reduced to a "bag-of-instances" problem (where instead of 256-level features, we have 4096-level features). Using both the official CLAM repository as well as the modified CLAM scaffold used for this work (training commands detailed in the README), I ran a quick experiment that checks how well these features perform on TCGA-BRCA/NSCLC/RCC subtyping using my 10-fold CV. The experiment was run on a machine with a different version of PyTorch+CUDA than the one used in the paper so results may not be exact (and did this somewhat quickly in < 1 hour, so may have also made some mistakes), but you can see here that: 1) results are on-par with reported results in the paper. Both vanilla MIL (pooling [M x 192]
features) and a "global aggregation only" version of HIPT (performing vanilla self-attention on [M x 192]
features) were trained with 25% / 100% of training data. 2) Training logs are found here. 3) Since all features were made available on GitHub, one can simply rerun these experiments following the README. 4) Both the official CLAM repository and my modified CLAM version gave similar results, and would be happy to provide the former as well.
What problems are you looking to apply HIPT too? I appreciated reading you detailed response in getting this method to work correctly, I would be happy to understand and work through any pain points you have in using this method on TCGA (and other downstream tasks).
Hi @Richarizardd, thank you for answering so quickly & with details!
img_size = [256]
, all pre-trained weights are nicely loaded, except the positional embedding (because of the mismatching shape). I just realised that, in that case, my positional embedding will be a random tensor during the whole training process (given it gets initialised as such & then gets frozen). I'll stick to img_size = [224]
for now.cls_token
per region.I'm looking to apply HIPT to various computational pathology problems and compare how other methods perform on the same tasks.
Once again, big thanks for the fast & detailed answer. Will reach back to you when I have something new!
Finally got it working!
The issue was coming from me using img_size = [256]
instead of the [224]
value when instantiating VisionTransformer
/ VisionTransformer4K
components in HIPT model. This caused the pre-trained weights for the positional embedding parameter to be skipped when loading state dict because of mismatching shape. As a results, when generating region-level features, positional embeddings were left as initialised, that is random tensors (normal dist)! This caused my region-level features to be garbage...
I've re-generated the region-level features with img_size = [224]
and now got decent loss profiles & AUC number, great!
Before I clause this issue, I had two small follow-up questions:
[M, 256, 384]
features that comes from interpolating the positional embeddings (line 214 to 218)
https://github.com/mahmoodlab/HIPT/blob/b5f4844f2d8b013d06807375166817eeb939a5aa/HIPT_4K/vision_transformer4k.py#L214-L218
Here's the associated warning: UserWarning: Default upsampling behavior when mode=bicubic is changed to align_corners=False since 0.4.0. Please specify align_corners=True if the old behavior is desired. See the documentation of nn.Upsample for details.
To supresss it, I would add align_corners=False
, but wanted to make sure this was the behavior you would also go for.
[M, 256, 384]
features (i.e. a HIPT model with a pre-trained self.local_vit
component) instead of training the global aggregation transformer on [M, 192]
features is: fine-tune the pre-trained self.local_vit
(by allowing gradients to flow through this component).In case the pre-trained self.local_vit
component gets frozen, the former should yield the same results as the later. But given it will have a longer forward pass (the features additionally have to go through self.local_vit
), one should favour the later.
Thank you for your previous answer, it really helped me find where the issue was coming from! Now that it is fixed, I'll try to reproduce the experiments you report in the paper & look at the ones you've recently run and linked in your answer above. Will be interesting!
align_corners=False
self.local_vit
is more expensive to run, one can do more data augmentation via running self.local_vit
with dropout. For larger datasets, finetuning self.local_vit
may also be helpful. Lastly, another advantage is that with both features at 256- and 4096-level, one can also try exploring other variations such as concatenating: 1) slide feature from aggregating 256-level features via ABMIL, 2) slide feature from aggregating 4096-level features via ABMIL, and 3) slide feature from last Transformer. I have not tried other strategies, but seems intuitive for capturing the "different scales of features" across resolutions. Would be fun to also mix-and-match different aggregation functions.Thank you for reporting these issues again. I will reflect these changes sometime this weekend.
For keeping records, I got 0.883 ± 0.06 AUC for the breast subtyping task (ILC vs. IDC) using the same dataset (same 875 slides, same 10 folds).
This is on par with the results reported in the paper (0.874 ± 0.06, see Table 1).
The slight difference comes from me using different region-level pre-extracted features: I slightly adapted CLAM patching code to generate [4096,4096] regions per slide, then used the provided pre-trained weights to produce region-level features of shape [M, 192]
. For each slide, I have slightly different regions.
Hi @clemsgrs, I have a small question. When training at the slide level, did you set freeze_4k = True?
hi @bryanwong17, when training on region-level features (i.e. sequence of embeddings shaped [M, 192]
), I did set freeze_4k = True
Hi @clemsgrs, is the input of training HIPT_LGP_FC [M, 256, 384]? when we define 'pretrain_4k != None', it would load 'vit4k_xs_dino.pth' and change the dimension to [M. 192]? Then, we set 'freeze_4k=True'?
Hi, I'm trying to replicate the subtyping results you report on TCGA BRCA as a sanity check before applying HIPT to a different dataset. Hence, I'm using the same slides, same splits & same labels as given in the repo. For now, I've sticked to training & evaluating on fold_0.
I'm having troubles when training a model: my training loss barely goes down (see picture below: loss plateaus after epoch 6, with training AUC being around 0.50).
After having deeply dived into the code, there are a few things I'd love to have your help to understand:
In
HIPT_LGP_FC
you setself.local_vit = vit4k_xs()
; based on the following lines, it meansself.local_vit
is an instance ofVisionTransformer4K
withpatch_size = 16
https://github.com/mahmoodlab/HIPT/blob/2e0adbe943175bcc13327a4f2e8785b59d6c6249/HIPT_4K/vision_transformer4k.py#L267-L272 Then, looking at theVisionTransformer4K
class, the defaultimg_size
argument is[224]
. Combined withpatch_size = 16
, this means thatnum_patches = 196
(line 170), which is used line 174 to instantiateself.pos_embed
https://github.com/mahmoodlab/HIPT/blob/2e0adbe943175bcc13327a4f2e8785b59d6c6249/HIPT_4K/vision_transformer4k.py#L161-L174 Hence, if we feedHIPT_LGP_FC
a tensor of shape[M, 256, 384]
as done in the model walkthrough notebook, at some point during the forward pass, theinterpolate_pos_encoding
method gets called. Givenx.shape = [M, 257, 384]
andpos_embed.shape = [1, 197, 192]
,npatch = 256
andN = 196
: the conditionnpatch == N
on line 204 is False, so we need interpolate the positional embedding https://github.com/mahmoodlab/HIPT/blob/2e0adbe943175bcc13327a4f2e8785b59d6c6249/HIPT_4K/vision_transformer4k.py#L201-L205why the
patch_size
argument passed when instantiatingVisionTransformer4K
is actually not used inVisionTransformer4K.__init__()
-- instead, a hard coded value of16
is used (line 170, see below) https://github.com/mahmoodlab/HIPT/blob/2e0adbe943175bcc13327a4f2e8785b59d6c6249/HIPT_4K/vision_transformer4k.py#L170why the
img_size
argument passed when instantiatingVisionTransformer4K
is left as default (i.e.img_size = [224]
) and not set to[256]
? I get that during self-supervised pre-training, you use crops of size [224, 224], but during subtyping, we're using the full [256, 256] patch, so I guess we should useimg_size = [256]
, shouldn't we? Doing so, the previously discussed conditionnpatch == N
would become True (hence we would not need to interpolate the positional embedding anymore).Given we pass a tensor of shape
[M, 256, 384]
toHIPT_LGP_FC
, which get reshaped to[M, 384, 16, 16]
before being passed toHIPT_LGP_FC.local_vit
, the following line givesB = M
. https://github.com/mahmoodlab/HIPT/blob/2e0adbe943175bcc13327a4f2e8785b59d6c6249/HIPT_4K/vision_transformer4k.py#L226 Then, in the following line we definecls_token
as a tensor of shape[M, 1, 192]
. Isn't there a confusion betweenB
(supposed to account for the batch size) andM
(number of [4096, 4096] regions per slide)? Shouldn't thecls_token
tensor be of shape[batch_size, 1, 192]
? https://github.com/mahmoodlab/HIPT/blob/2e0adbe943175bcc13327a4f2e8785b59d6c6249/HIPT_4K/vision_transformer4k.py#L232-L233I've also tried training only the global aggregation layers by directly feeding the region-level pre-extracted features (of shape
[M, 192]
), without success (training loss not really decreasing either). Could you confirm that this should work just as well as training the intermediate transformer + the global aggregation layers on the[M, 256, 384]
features?Thanks!