Open saghiralfasly opened 1 month ago
Hi @saghiralfasly - Thank you for your comments. I have previously trained DINOv1 algorithms (in this workshop paper, which has been used later in HIPT), but not with ViT-Large.
DINOv2 was used for UNI because at the time (and still true currently), it is the best-in-class SSL recipe for extracting frozen feature representations, and makes available a carefully-refined recipe for developing SSL models on large datasets (100M+ images). Most other SSL models (SimCLR, DINO, iBOT) are developed around smaller-sized image datasets (IN-1K ~= 1M images, IN-22K ~= 14M images). With a limited compute budget, we could not exhaustively explore other SSL algorithms (also across different model and dataset sizes), which would have likely needed heavy exploration of hyper-parameters to scale with 100M+ images.
In our experimentation, we did perform head-to-head comparisons with iBOT (across ViT-B/L and Mass-1K/22K/100K), and found DINOv2 to deliver better results in model and data scaling. On Mass-1K pretraining, MoCoV3 (with both the VIT-L and ResNet-50 backbone) did not perform as well as DINOv2 VIT-L. As iBOT was already similar to DINOv2 (in the same family of SSL algorithms), we favored experimentation with other algorithms like MoCoV3 instead.
Thank you, @Richarizardd! Yes, I read the supplementary material of the paper where the impact of data/model scaling on final performance is assessed. It's interesting and crucial to see such ablation studies. However, in histopathology model training, we still lack comprehensive ablation studies comparing DinoV2 with DinoV1. This is concerning, as I believe that while MIM might enhance the quality of learned representations in general object-centric datasets, it may not necessarily yield the same improvements for histopathology.
Recently, I came across two papers where DinoV1 was compared with DinoV2 or MAE in the histopathology domain. In https://arxiv.org/pdf/2404.15217, they adapted ViT-B for DinoV1 and ViT-L for DinoV2, while in https://arxiv.org/pdf/2310.07033, they used ViT-S for DinoV1 and ViT-L for MAE.
Therefore, we still lack a focused ablation study comparing DinoV1 and DinoV2 with the exact same ViT baseline, unless there are studies I may have missed.
Thank you for making UNI publically available. Since you may have extensive experience training UNI and other models in a self-supervised paradigm, I would like to ask if you have trained and compared DinoV1 against DinoV2 with the same ViT architecture on the same dataset. In other words, why DinoV2 has been used for training UNI?