Open askerlee opened 3 years ago
BTW the transformer layers used by huggingface's vit is basically a verbatim copy of the transformer used in their Bert model.
I also checked rwightman's pytorch-image-models. The vision transformer he implemented has all the dropouts. The dropout rate is 0.1, the same as yours.
Do these fine details matter so much?
Won't have big impact, but maybe some fraction of a point. I don't have enough computation resources to find out...
Hi Phil, thanks for the great repo. I compared your implementation of ViT with huggingface's (https://github.com/huggingface/transformers/blob/master/src/transformers/models/vit/modeling_vit.py) and found some subtle differences. In particular: 1) in the attention module, huggingface's vit has a dropout after softmax (i.e., to the attention probability matrix). But yours doesn't (unless there's a linear projection layer). 2) in the FFN, after the first linear transformation, yours have a dropout but huggingface's doesn't have. 3) The default dropout rate you adopted is 0.1, whereas huggingface vit is 0.15. I wonder have you tried different ways of dropout? Would they produce any noticeable differences? Thank you very much.