ozanciga / self-supervised-histopathology

Pretrained model for self supervised histopathology
MIT License
110 stars 12 forks source link

which 400,000 images have you used for cpkt? #10

Closed monajalal closed 2 years ago

monajalal commented 2 years ago

Hi Ozan,

I couldn't find it in your transcript which images are these 400,000 images. Could you please both name them as well as providing the link to downloading them? Thank you so much for the great work. This would also help with the reproducibility aspect for folks like me :)

Screen Shot 2022-02-18 at 2 15 49 PM Screen Shot 2022-02-18 at 2 16 10 PM
ozanciga commented 2 years ago

Thank you @monajalal for your interest. 400,000 images are sampled from the given datasets evenly. As some of these datasets require the owner's permission before you can access them, I am not allowed to share them (copyright issues). However, you may download each or some of these datasets, and sample from them. Note that they are fully public so in the end you can reconstruct/reproduce it, however obviously the main idea of that paper was to do self-supervision with a large number of images - so dataset curation is by no means trivial. Also, please refer to the text for more details on how the diversity and the # of images in pretraining impact the outcome. While the model you are referring to performs the best, there are diminishing returns at some point, so I imagine you should be able to achieve ~95% of the final performance with a much smaller dataset (again, refer to the paper for the actual figures --which may vary slightly per use case).

I'm not closing this in case you have more questions, or feel free to close it.