lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
18.92k stars 2.88k forks source link

Different transformer implementations with huggingface vit #126

Open askerlee opened 3 years ago

askerlee commented 3 years ago

Hi Phil, thanks for the great repo. I compared your implementation of ViT with huggingface's (https://github.com/huggingface/transformers/blob/master/src/transformers/models/vit/modeling_vit.py) and found some subtle differences. In particular: 1) in the attention module, huggingface's vit has a dropout after softmax (i.e., to the attention probability matrix). But yours doesn't (unless there's a linear projection layer). 2) in the FFN, after the first linear transformation, yours have a dropout but huggingface's doesn't have. 3) The default dropout rate you adopted is 0.1, whereas huggingface vit is 0.15. I wonder have you tried different ways of dropout? Would they produce any noticeable differences? Thank you very much.

askerlee commented 3 years ago

BTW the transformer layers used by huggingface's vit is basically a verbatim copy of the transformer used in their Bert model.

askerlee commented 3 years ago

I also checked rwightman's pytorch-image-models. The vision transformer he implemented has all the dropouts. The dropout rate is 0.1, the same as yours.

shabie commented 2 years ago

Do these fine details matter so much?

askerlee commented 2 years ago

Won't have big impact, but maybe some fraction of a point. I don't have enough computation resources to find out...