lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
19.63k stars 2.96k forks source link

ViT and 3D images #125

Open vlatorre847 opened 3 years ago

vlatorre847 commented 3 years ago

Hi,

I would like to use the ViT to solve a regression problem with three dimensional images.

Did anyone already try to do such a thing?

V. L.

lucidrains commented 3 years ago

@vlatorre847 yes, attention is all you need. just do the patching to tokens, but take 3 dimensional patches instead

x = rearrange(x, 'b c (x p1) (y p2) (z p3) -> b (x y z) (c p1 p2 p3)', p1 = p, p2 = p, p3 = p)
cmartin-isla commented 3 years ago

Hello, thanks for this nice repo. I have a question. I work normally with grayscale images and not quite familiar with einops. Does that imply that chanels are simply concatenated? here for example we would have a concatenation of unzipped 3d patches, c times?

xiaoyu0318 commented 2 years ago

@vlatorre847 yes, attention is all you need. just do the patching to tokens, but take 3 dimensional patches instead

x = rearrange(x, 'b c (x p1) (y p2) (z p3) -> b (x y z) (c p1 p2 p3)', p1 = p, p2 = p, p3 = p)

how to invert to original order?