Closed monajalal closed 2 years ago
All datasets are used in 10% weights.
I'm not sure if I understand this question. All datasets are used in sampling. From WSIs, we sample patches. For image datasets, we take a random crop per image if the image resolution for the dataset is consistent with the rest of the dataset. Else, we resize and random crop (e.g., if we are sampling 0.5um/pixel datasets, and image dataset is 0.25um/pixel, we downsample by 2, then crop etc). The sampled images are 224x224. There are actually mixed resolution experiments on the paper. See section 5.5. and Table 4.
This is a contentious topic. I am not sure about your specific dataset, but most WSI datasets (including most TCGA-) had hundreds or maybe more WSIs. Therefore, sampling a few patches from only a few of WSIs should in fact make no difference. My argument is, in effect you should be able to use the same source. I would even argue (albeit not effectively to some reviewers!) this should be encouraged, as the exact information you are encoding is the power of self supervision. In fact this is what most self supervised work do: using Imagenet dataset for both pretraining and fine-tuning. But, you should also be aware that some researchers don't share this view and you may get pushback so approach it with caution.
We may release the training code at some point, we won't release the dataset code. I am no longer a student so I have some other obligations but I will try to get around to it (mostly cleaning up the code so it is presentable). Consider using any public SimCLR implementation though, the code was mostly an exercise on distributed computing more than anything, as most of the simclr model is already implemented by pytorch itself (via the torchvision model).
Please refer to the Table 1. The way we ended up with 400K (or 40K etc) number was due to the limit I put on samples per image (see Section 5.5., first few sentences). Most of my experimental intuition came from the original Simclr paper. We used most of their optimization settings and few tweaks that improved the results can be seen on the Appendix. I would say the diversity between images is more relevant than the number (see the conclusion). But if you really need a number, use anything above 5-10K images to be safe. Also, more pretraining is generally better, anything above 500 epochs is a safe bet (unless you see some numerical issues which manifests itself in your loss, where it explodes).
Hi Ozan,
Once again, thanks a lot for the great paper and added value to the computational pathology community.
I noticed that you have used TCGA UCEC in your training as in Table F.13 of the accepted journal publication.
My questions are: