ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
49.65k stars 16.11k forks source link

Apply Transformer in the backbone #2329

Closed dingyiwei closed 3 years ago

dingyiwei commented 3 years ago

🚀 Feature

Transformer is popular in NLP, and now is also applied on CV. I added C3TR just by replacing the sequential self.m in C3 with a Transformer block, which could reduce GFlOPs and make Yolo achieve a better result.

Motivation

Pitch

I add 3 classes in https://github.com/dingyiwei/yolov5/blob/Transformer/models/common.py :

class TransformerLayer(nn.Module):
    def __init__(self, c, num_heads):
        super().__init__()

        self.ln1 = nn.LayerNorm(c)
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.ln2 = nn.LayerNorm(c)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)

    def forward(self, x):
        x_ = self.ln1(x)
        x = self.ma(self.q(x_), self.k(x_), self.v(x_))[0] + x
        x = self.ln2(x)
        x = self.fc2(self.fc1(x)) + x
        return x

class TransformerBlock(nn.Module):
    def __init__(self, c1, c2, num_heads, num_layers):
        super().__init__()

        self.conv = None
        if c1 != c2:
            self.conv = Conv(c1, c2)
        self.linear = nn.Linear(c2, c2)
        self.tr = nn.Sequential(*[TransformerLayer(c2, num_heads) for _ in range(num_layers)])
        self.c2 = c2

    def forward(self, x):
        if self.conv is not None:
            x = self.conv(x)
        b, _, w, h = x.shape
        p = x.flatten(2)
        p = p.unsqueeze(0)
        p = p.transpose(0, 3)
        p = p.squeeze(3)
        e = self.linear(p)
        x = p + e

        x = self.tr(x)
        x = x.unsqueeze(3)
        x = x.transpose(0, 3)
        x = x.reshape(b, self.c2, w, h)
        return x

class C3TR(C3):
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)
        self.m = TransformerBlock(c_, c_, 4, n)

And I just put it as the last part of the backbone instead of a C3 block.

backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, C3TR, [1024, False]],  # 9    <---- here is my modifcation
  ]

I conducted experiments on 2 Nvidia GTX 1080Ti cards, where depth_multiple and width_multiple are the same as Yolov5s. Here are my experimental results with img-size 640. For convenience I named the method in this issue as Yolov5TRs.

Model Params GFLOPs
Yolov5s 7266973 17.0
Yolov5TRs 7268765 16.8
Model Dataset TTA mAP@.5 mAP@.5:.95 Speed (ms)
Yolov5s coco (val) N 0.558 0.365 4.4
Yolov5TRs coco (val) N 0.568 0.363 4.4
Yolov5s coco (test-dev) N 0.559 0.365 4.6
Yolov5TRs coco (test-dev) N 0.567 0.365 4.5
Yolov5s coco (test-dev) Y 0.568 0.378 12.0
Yolov5TRs coco (test-dev) Y 0.571 0.375 11.0

We can see that Yolov5TRs get higher scores in mAP@0.5 with a faster speed. (I'm not sure why my results of Yolov5s are different from which shown in README. The model was downloaded from release v4.0) When depth_multiple and width_multiple are set to larger numbers, C3TR should be more lightweight than C3. Since I do not have so much time on it and my machine is not very strong, I did not run experiments on M, L and X. Maybe someone could conduct the future experiments:smile:

glenn-jocher commented 2 years ago

@dingyiwei #5645 PR is merged, replacing multiple transpose ops with a single permute in TransformerBlock(). Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

qiy20 commented 2 years ago

Sorry for the delay. @dingyiwei is right! I ignore the arg batch_first=FLase.

qiy20 commented 2 years ago

But,i have another question about the pos embeding. self.linear = nn.Linear(c2, c2) # learnable position embedding It seems to be difference values to difference feature maps, if the feature map values change, the pos embedding will change. this is different from ViT or transformer, Why do you think so? @dingyiwei @glenn-jocher

dingyiwei commented 2 years ago

Good question😂 Indeed ViT uses 1D learnable random-generated parameters as the pos embedding. I knew more about CV but little about NLP so I felt unfamiliar with the pos embedding at that time and applied a common operation in CV - something like a residual Linear layer.

Detection is different from classification. It's hard to say whether a residual layer or standalone parameters works better for the pos embedding on Yolo. I'll try to conduct experiments on this issue and post results here.

qiy20 commented 2 years ago

I think the pos embedding reflects the distance between the feature points, so standalone parameters may be better, the Linear(x) doesn't contain much position information.

sakurasakura1996 commented 2 years ago

@dingyiwei I have a question, why transformer block only includes encoder, not including decoder. Is the encoder more suitable for classification tasks?

nrupatunga commented 2 years ago

@sakurasakura1996

my understanding is that here the intention of adding transformer block is to get the better features (by attending to different parts of the image), which might result in better box/class predictions compared to other modules (eg. C3)

iscyy commented 2 years ago

@dingyiwei hi, I have a question, if the transformer module is added, does it mean that the previous pure CNN pre-training weights can no longer be used.

dingyiwei commented 2 years ago

@Him-wen Yes, you have to train the model from scratch.

mx2013713828 commented 2 years ago

@Him-wen Yes, you have to train the model from scratch.

can you provide a pretrained transformer model?thx!!!

dingyiwei commented 2 years ago

@mx2013713828 You may want to find a outdated model here with this commit. No official pretrained models for Yolov5s-transformer.

zhangweida2080 commented 2 years ago

@dingyiwei Do you have a reference to use this kind of structure?

dingyiwei commented 2 years ago

@zhangweida2080 You may want to take a look at my first few comments in this thread.

zhangweida2080 commented 2 years ago

@dingyiwei Thank you for your reply. There is no fixed thinking about the usage in different settings. However, since your original idea is from ViT (https://arxiv.org/pdf/2010.11929.pdf), I suppose you will follow the implementation of vit. There are some differences:

Thanks a lot.

dingyiwei commented 2 years ago

@zhangweida2080 For the first 2 questions, I had to work out a way to get a better result on it in a very short time due to my personal requirements, so I built a much simpler structure than the Transformer in that paper (but it really worked on COCO anyway) and shared here. If I got more time and more resources, I would try more structures and conduct more experiments. For the 3rd question, it's really common when you try to apply a popular model on a customized dataset, since the model could be tended to improve for some popular datasets. I cannot help on your specific problem, but I would suggest training a pre-trained model on a large dataset, collecting as much as data you could and do your best on data augmentation. Good luck :)