Open harrygcoppock opened 1 year ago
I also see checks inplace which in my eyes would prevent a cls token being simply prepended e.g. in the forward method of the transformerblock.
What the authors are talking about is applying a global average pooling layer to the feature map output in the final stage and then using a linear classifier for image classification, the strategy that has the same effect as ViT with [cls] token accuracy.
Thanks for the response; however, what you describe is the standard approach for SWIN. They explicitly say that they tried CLS prepending.
Thank you for the great paper and code repo, super nice idea.
You mention in the paper that you experiment with appending a CLS token and using this to perform classification. I was wondering how you treat this CLS token - does it attend to all patches or just just the patches which fall into its local area (in the swin self attention process)? I also cannot find where this is implemented in code as this would be helpful.
Many thanks, Harry