which 400,000 images have you used for cpkt?

Thank you @monajalal for your interest. 400,000 images are sampled from the given datasets evenly. As some of these datasets require the owner's permission before you can access them, I am not allowed to share them (copyright issues). However, you may download each or some of these datasets, and sample from them. Note that they are fully public so in the end you can reconstruct/reproduce it, however obviously the main idea of that paper was to do self-supervision with a large number of images - so dataset curation is by no means trivial. Also, please refer to the text for more details on how the diversity and the # of images in pretraining impact the outcome. While the model you are referring to performs the best, there are diminishing returns at some point, so I imagine you should be able to achieve ~95% of the final performance with a much smaller dataset (again, refer to the paper for the actual figures --which may vary slightly per use case).

I'm not closing this in case you have more questions, or feel free to close it.

ozanciga / self-supervised-histopathology

which 400,000 images have you used for cpkt? #10