jeonsworld / ViT-pytorch

Pytorch reimplementation of the Vision Transformer (An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)
MIT License
1.95k stars 374 forks source link

Why the position_embeddings are zeros? #6

Closed Erichen911 closed 4 years ago

Erichen911 commented 4 years ago

This is a great git! Thanks a lot

My question is

when embedding the image patches, why the position_embeddings and the cls_token are zeros?

jeonsworld commented 4 years ago

The position embedding id indicates the position of the patch. It consists of 0 to n and 0 is used because cls_token corresponds to the first position. As shown in the following figure, the image is divided into patches and positions from 1 to n are used.

image