microsoft / Swin-Transformer

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows".
https://arxiv.org/abs/2103.14030
MIT License
13.29k stars 2.02k forks source link

CLS token #298

Open harrygcoppock opened 1 year ago

harrygcoppock commented 1 year ago

Thank you for the great paper and code repo, super nice idea.

You mention in the paper that you experiment with appending a CLS token and using this to perform classification. I was wondering how you treat this CLS token - does it attend to all patches or just just the patches which fall into its local area (in the swin self attention process)? I also cannot find where this is implemented in code as this would be helpful.

Many thanks, Harry

harrygcoppock commented 1 year ago

I also see checks inplace which in my eyes would prevent a cls token being simply prepended e.g. in the forward method of the transformerblock.

https://github.com/microsoft/Swin-Transformer/blob/ad1c947e76791d8623b61d178c715f737748ade8/models/swin_transformer.py#L251

abueidvchow commented 8 months ago

What the authors are talking about is applying a global average pooling layer to the feature map output in the final stage and then using a linear classifier for image classification, the strategy that has the same effect as ViT with [cls] token accuracy.

harrygcoppock commented 8 months ago

Thanks for the response; however, what you describe is the standard approach for SWIN. They explicitly say that they tried CLS prepending.