How to extract features using your checkpoint on TCGA UCEC or any other dataset that you already used in your training procedure?

Hi Ozan,

Once again, thanks a lot for the great paper and added value to the computational pathology community.

I noticed that you have used TCGA UCEC in your training as in Table F.13 of the accepted journal publication.

My questions are:

is TCGA UCEC also used in this checkpoint tenpercent_resnet18.ckpt?
are all the 400,000 images used in these checkpoints listed in Table F.13? What are the ones not listed? By image, do you mean a patch? What is the patch size if not mixed size patches? (e.g. do you have both 512x512 and 256x256 or only 512x512 patches?)
if TCGA UCEC is used in tenpercent_resnet18.ckpt checkpoint, would it make any sense to extract the features of TCGA UCEC patches using your checkpoint (somehow I think it doesn't make sense)? If it doesn't make sense, how can I use your saved checkpoint to extract patches? Since you already have covered many cancer sub-data types, the question would be how the community can make use of your learned model for feature extraction of patches if their WSI/patches were already used in your dataset?
given your journal is published, could you please share the code for creating the checkpoint using our own dataset?
what is the minimum number of images for the saved checkpoint to produce meaningful results? I see your latest checkpoint is 400,000 images but did the previous smaller ones also produce meaningful results? How did you find the number?

All datasets are used in 10% weights.
I'm not sure if I understand this question. All datasets are used in sampling. From WSIs, we sample patches. For image datasets, we take a random crop per image if the image resolution for the dataset is consistent with the rest of the dataset. Else, we resize and random crop (e.g., if we are sampling 0.5um/pixel datasets, and image dataset is 0.25um/pixel, we downsample by 2, then crop etc). The sampled images are 224x224. There are actually mixed resolution experiments on the paper. See section 5.5. and Table 4.
This is a contentious topic. I am not sure about your specific dataset, but most WSI datasets (including most TCGA-) had hundreds or maybe more WSIs. Therefore, sampling a few patches from only a few of WSIs should in fact make no difference. My argument is, in effect you should be able to use the same source. I would even argue (albeit not effectively to some reviewers!) this should be encouraged, as the exact information you are encoding is the power of self supervision. In fact this is what most self supervised work do: using Imagenet dataset for both pretraining and fine-tuning. But, you should also be aware that some researchers don't share this view and you may get pushback so approach it with caution.
We may release the training code at some point, we won't release the dataset code. I am no longer a student so I have some other obligations but I will try to get around to it (mostly cleaning up the code so it is presentable). Consider using any public SimCLR implementation though, the code was mostly an exercise on distributed computing more than anything, as most of the simclr model is already implemented by pytorch itself (via the torchvision model).
Please refer to the Table 1. The way we ended up with 400K (or 40K etc) number was due to the limit I put on samples per image (see Section 5.5., first few sentences). Most of my experimental intuition came from the original Simclr paper. We used most of their optimization settings and few tweaks that improved the results can be seen on the Appendix. I would say the diversity between images is more relevant than the number (see the conclusion). But if you really need a number, use anything above 5-10K images to be safe. Also, more pretraining is generally better, anything above 500 epochs is a safe bet (unless you see some numerical issues which manifests itself in your loss, where it explodes).

ozanciga / self-supervised-histopathology

How to extract features using your checkpoint on TCGA UCEC or any other dataset that you already used in your training procedure? #11