mahmoodlab / UNI

Towards a general-purpose foundation model for computational pathology - Nature Medicine
Other
312 stars 41 forks source link

DinoV1 vs DinoV2 - The reason behind using DinoV2 for training UNI #42

Open saghiralfasly opened 3 days ago

saghiralfasly commented 3 days ago

Thank you for making UNI publically available. Since you may have extensive experience training UNI and other models in a self-supervised paradigm, I would like to ask if you have trained and compared DinoV1 against DinoV2 with the same ViT architecture on the same dataset. In other words, why DinoV2 has been used for training UNI?

Richarizardd commented 2 days ago

Hi @saghiralfasly - Thank you for your comments. I have previously trained DINOv1 algorithms (in this workshop paper, which has been used later in HIPT), but not with ViT-Large.

DINOv2 was used for UNI because at the time (and still true currently), it is the best-in-class SSL recipe for extracting frozen feature representations, and makes available a carefully-refined recipe for developing SSL models on large datasets (100M+ images). Most other SSL models (SimCLR, DINO, iBOT) are developed around smaller-sized image datasets (IN-1K ~= 1M images, IN-22K ~= 14M images). With a limited compute budget, we could not exhaustively explore other SSL algorithms (also across different model and dataset sizes), which would have likely needed heavy exploration of hyper-parameters to scale with 100M+ images.

In our experimentation, we did perform head-to-head comparisons with iBOT (across ViT-B/L and Mass-1K/22K/100K), and found DINOv2 to deliver better results in model and data scaling. On Mass-1K pretraining, MoCoV3 (with both the VIT-L and ResNet-50 backbone) did not perform as well as DINOv2 VIT-L. As iBOT was already similar to DINOv2 (in the same family of SSL algorithms), we favored experimentation with other algorithms like MoCoV3 instead.