jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.94k stars 370 forks source link

Patch size and n_patches calculation issue when grid is specified #56

Open alabamagan opened 1 year ago

alabamagan commented 1 year ago

https://github.com/jeonsworld/ViT-pytorch/blob/460a162767de1722a014ed2261463dbbc01196b6/models/modeling.py#L132C1-L136C31

Here it seems the calculation for n_patches and patch_size are incorrect. I think if 16 x 16 patches were assumed, the grid size is already determined? Or am I mistaken about the purpose of the grid config?

cai-wenbo commented 6 months ago

I just read the code and had the same question. Then found this unsolved issue, I'm quite confirmed that this piece of code is of wrong logic.

alabamagan commented 6 months ago

I just read the code and had the same question. Then found this unsolved issue, I'm quite confirmed that this piece of code is of wrong logic.

Yeah I end up rewriting my own version. I end up using implementation by another person: https://github.com/junyuchen245/ViT-V-Net_for_3D_Image_Registration_Pytorch