Closed KiyoshiMu closed 1 year ago
Thanks for sharing this excellect work! The method is both amazing and elegant.
I wonder if there is a pretrained ViTWSI-4096(n = 2, h = 3, d = 192) which aggregate the [CLS]4096 tokens and generate a slide-level representaion.
I would be interested in this too
Oops - closed this issue without comment. At the moment, there is not a ViT trained for [4096 x 4096] tokens, but it is exciting future work!
Thanks for sharing this excellect work! The method is both amazing and elegant.
I wonder if there is a pretrained ViTWSI-4096(n = 2, h = 3, d = 192) which aggregate the [CLS]4096 tokens and generate a slide-level representaion.