Closed dingyiwei closed 3 years ago
@dingyiwei #5645 PR is merged, replacing multiple transpose ops with a single permute in TransformerBlock()
. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐
Sorry for the delay. @dingyiwei is right! I ignore the arg batch_first=FLase.
But,i have another question about the pos embeding.
self.linear = nn.Linear(c2, c2) # learnable position embedding
It seems to be difference values to difference feature maps, if the feature map values change, the pos embedding will change. this is different from ViT or transformer, Why do you think so?
@dingyiwei @glenn-jocher
Good question😂 Indeed ViT uses 1D learnable random-generated parameters as the pos embedding. I knew more about CV but little about NLP so I felt unfamiliar with the pos embedding at that time and applied a common operation in CV - something like a residual Linear layer.
Detection is different from classification. It's hard to say whether a residual layer or standalone parameters works better for the pos embedding on Yolo. I'll try to conduct experiments on this issue and post results here.
I think the pos embedding reflects the distance between the feature points, so standalone parameters may be better, the Linear(x) doesn't contain much position information.
@dingyiwei I have a question, why transformer block only includes encoder, not including decoder. Is the encoder more suitable for classification tasks?
@sakurasakura1996
my understanding is that here the intention of adding transformer block is to get the better features (by attending to different parts of the image), which might result in better box/class predictions compared to other modules (eg. C3)
@dingyiwei hi, I have a question, if the transformer module is added, does it mean that the previous pure CNN pre-training weights can no longer be used.
@Him-wen Yes, you have to train the model from scratch.
@Him-wen Yes, you have to train the model from scratch.
can you provide a pretrained transformer model?thx!!!
@mx2013713828 You may want to find a outdated model here with this commit. No official pretrained models for Yolov5s-transformer.
@dingyiwei Do you have a reference to use this kind of structure?
@zhangweida2080 You may want to take a look at my first few comments in this thread.
@dingyiwei Thank you for your reply. There is no fixed thinking about the usage in different settings. However, since your original idea is from ViT (https://arxiv.org/pdf/2010.11929.pdf), I suppose you will follow the implementation of vit. There are some differences:
Thanks a lot.
@zhangweida2080 For the first 2 questions, I had to work out a way to get a better result on it in a very short time due to my personal requirements, so I built a much simpler structure than the Transformer in that paper (but it really worked on COCO anyway) and shared here. If I got more time and more resources, I would try more structures and conduct more experiments. For the 3rd question, it's really common when you try to apply a popular model on a customized dataset, since the model could be tended to improve for some popular datasets. I cannot help on your specific problem, but I would suggest training a pre-trained model on a large dataset, collecting as much as data you could and do your best on data augmentation. Good luck :)
🚀 Feature
Transformer is popular in NLP, and now is also applied on CV. I added
C3TR
just by replacing the sequentialself.m
inC3
with a Transformer block, which could reduce GFlOPs and make Yolo achieve a better result.Motivation
Pitch
I add 3 classes in https://github.com/dingyiwei/yolov5/blob/Transformer/models/common.py :
And I just put it as the last part of the backbone instead of a
C3
block.I conducted experiments on 2 Nvidia GTX 1080Ti cards, where
depth_multiple
andwidth_multiple
are the same as Yolov5s. Here are my experimental results withimg-size
640. For convenience I named the method in this issue as Yolov5TRs.We can see that Yolov5TRs get higher scores in mAP@0.5 with a faster speed. (I'm not sure why my results of Yolov5s are different from which shown in README. The model was downloaded from release v4.0) When
depth_multiple
andwidth_multiple
are set to larger numbers,C3TR
should be more lightweight thanC3
. Since I do not have so much time on it and my machine is not very strong, I did not run experiments on M, L and X. Maybe someone could conduct the future experiments:smile: