Add `PANDASmall` dataset

nkaenzig commented 1 month ago

Closes #662

Uses only 20% of all slides (~2000)
200 instead of 1000 patches per slide (found experimentally that this yields similar results still)

-> results in 25x less patches, therefore runs approximately 25x faster than the full panda benchmark, given that patch embedding generation takes up most compute time

roman807 commented 1 month ago

thanks @nkaenzig, looks good. How did you determine the data size (20% of slides & 200 patches) -- do we know that for example 10% of data or 100 patches would not be sufficient?

nkaenzig commented 1 month ago

@roman807 Good question.

The number 200 for the # patches was determined experimentally:

You can see in this graphic that there is a significant performance drop when going from 200 to only 100 patches.

Regarding the 20% question: This dataset has 6 classes, we want to make sure that in each of the train, val & test splits, we still have sufficient examples per class. Using the current ratio, we have 166 WSIs per class in the train set, and 83 samples per class in each val/test. Especially for the val/test set I don't want to go lower in terms of sample count. Also at 20%, the evaluation runtime becomes reasonable: for ViT-S inference eva predict only takes around 5-10 min , while for ViT-G (giant) it takes around 2 hours.

roman807 commented 1 month ago

Thinking about terminology. Should we use "small" instead of "tiny"? I think tiny usually refers to something very small, e.g. minimal data for unit or integration test

kaiko-ai / eva

Add `PANDASmall` dataset #664