lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
19.64k stars 2.96k forks source link

ViT-Dino for Medical images #166

Open Mushtaqml opened 2 years ago

Mushtaqml commented 2 years ago

HI!

I would like to thank you first for such a good and updated repo regarding Vision Transformers.

I want to know if I can use 3d medical images to pretrain the ViT using 3D medical images?. Do I need to make some changes to the sample code you shared.

Thanks

lucidrains commented 2 years ago

@Mushtaqml yea, you simply have to change the way you encode the patches to be 3d

change https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py#L92 to

# batch channel height, width, depth
Rearrange('b c (h p1) (w p2) (d p3) ->  b (c p1 p2 p3) h w d')
lucidrains commented 2 years ago

you'll have to calculate the appropriate input dimensions, which will be 3 * p1 * p2 * p3, and then also set the appropriate length for the absolute positional encoding

abdelkareemkobo commented 1 month ago

@lucidrains Hello, thanks for your help, could you help me upload my CT scan images into the Dinov2 the input for the model is in the following format py torch.Size([16,1,10,18,18]) # Batch, Channel, Depth , Height , Width how can i rearnge it !