Apply Transformer in the backbone

dingyiwei commented 3 years ago

🚀 Feature

Transformer is popular in NLP, and now is also applied on CV. I added C3TR just by replacing the sequential self.m in C3 with a Transformer block, which could reduce GFlOPs and make Yolo achieve a better result.

Motivation

Dosovitskiy et al. proposed ViT
Facebook applied Transformer on object detection as an encoder
So I thought Transformer could make yolo better

Pitch

I add 3 classes in https://github.com/dingyiwei/yolov5/blob/Transformer/models/common.py :

class TransformerLayer(nn.Module):
    def __init__(self, c, num_heads):
        super().__init__()

        self.ln1 = nn.LayerNorm(c)
        self.q = nn.Linear(c, c, bias=False)
        self.k = nn.Linear(c, c, bias=False)
        self.v = nn.Linear(c, c, bias=False)
        self.ma = nn.MultiheadAttention(embed_dim=c, num_heads=num_heads)
        self.ln2 = nn.LayerNorm(c)
        self.fc1 = nn.Linear(c, c, bias=False)
        self.fc2 = nn.Linear(c, c, bias=False)

    def forward(self, x):
        x_ = self.ln1(x)
        x = self.ma(self.q(x_), self.k(x_), self.v(x_))[0] + x
        x = self.ln2(x)
        x = self.fc2(self.fc1(x)) + x
        return x

class TransformerBlock(nn.Module):
    def __init__(self, c1, c2, num_heads, num_layers):
        super().__init__()

        self.conv = None
        if c1 != c2:
            self.conv = Conv(c1, c2)
        self.linear = nn.Linear(c2, c2)
        self.tr = nn.Sequential(*[TransformerLayer(c2, num_heads) for _ in range(num_layers)])
        self.c2 = c2

    def forward(self, x):
        if self.conv is not None:
            x = self.conv(x)
        b, _, w, h = x.shape
        p = x.flatten(2)
        p = p.unsqueeze(0)
        p = p.transpose(0, 3)
        p = p.squeeze(3)
        e = self.linear(p)
        x = p + e

        x = self.tr(x)
        x = x.unsqueeze(3)
        x = x.transpose(0, 3)
        x = x.reshape(b, self.c2, w, h)
        return x

class C3TR(C3):
    def __init__(self, c1, c2, n=1, shortcut=True, g=1, e=0.5):
        super().__init__(c1, c2, n, shortcut, g, e)
        c_ = int(c2 * e)
        self.m = TransformerBlock(c_, c_, 4, n)

And I just put it as the last part of the backbone instead of a C3 block.

backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, C3TR, [1024, False]],  # 9    <---- here is my modifcation
  ]

I conducted experiments on 2 Nvidia GTX 1080Ti cards, where depth_multiple and width_multiple are the same as Yolov5s. Here are my experimental results with img-size 640. For convenience I named the method in this issue as Yolov5TRs.

Model	Params	GFLOPs
Yolov5s	7266973	17.0
Yolov5TRs	7268765	16.8

Model	Dataset	TTA	mAP@.5	mAP@.5:.95	Speed (ms)
Yolov5s	coco (val)	N	0.558	0.365	4.4
Yolov5TRs	coco (val)	N	0.568	0.363	4.4
Yolov5s	coco (test-dev)	N	0.559	0.365	4.6
Yolov5TRs	coco (test-dev)	N	0.567	0.365	4.5
Yolov5s	coco (test-dev)	Y	0.568	0.378	12.0
Yolov5TRs	coco (test-dev)	Y	0.571	0.375	11.0

We can see that Yolov5TRs get higher scores in mAP@0.5 with a faster speed. (I'm not sure why my results of Yolov5s are different from which shown in README. The model was downloaded from release v4.0) When depth_multiple and width_multiple are set to larger numbers, C3TR should be more lightweight than C3. Since I do not have so much time on it and my machine is not very strong, I did not run experiments on M, L and X. Maybe someone could conduct the future experiments:smile:

glenn-jocher commented 2 years ago

@dingyiwei #5645 PR is merged, replacing multiple transpose ops with a single permute in TransformerBlock(). Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

qiy20 commented 2 years ago

Sorry for the delay. @dingyiwei is right! I ignore the arg batch_first=FLase.

qiy20 commented 2 years ago

But，i have another question about the pos embeding. self.linear = nn.Linear(c2, c2) # learnable position embedding It seems to be difference values to difference feature maps, if the feature map values change, the pos embedding will change. this is different from ViT or transformer, Why do you think so？ @dingyiwei @glenn-jocher

dingyiwei commented 2 years ago

Good question😂 Indeed ViT uses 1D learnable random-generated parameters as the pos embedding. I knew more about CV but little about NLP so I felt unfamiliar with the pos embedding at that time and applied a common operation in CV - something like a residual Linear layer.

Detection is different from classification. It's hard to say whether a residual layer or standalone parameters works better for the pos embedding on Yolo. I'll try to conduct experiments on this issue and post results here.

qiy20 commented 2 years ago

I think the pos embedding reflects the distance between the feature points, so standalone parameters may be better, the Linear(x) doesn't contain much position information.

sakurasakura1996 commented 2 years ago

@dingyiwei I have a question, why transformer block only includes encoder, not including decoder. Is the encoder more suitable for classification tasks？

nrupatunga commented 2 years ago

@sakurasakura1996

my understanding is that here the intention of adding transformer block is to get the better features (by attending to different parts of the image), which might result in better box/class predictions compared to other modules (eg. C3)

iscyy commented 2 years ago

@dingyiwei hi, I have a question, if the transformer module is added, does it mean that the previous pure CNN pre-training weights can no longer be used.

dingyiwei commented 2 years ago

@Him-wen Yes, you have to train the model from scratch.

mx2013713828 commented 2 years ago

@Him-wen Yes, you have to train the model from scratch.

can you provide a pretrained transformer model？thx！！！

dingyiwei commented 2 years ago

@mx2013713828 You may want to find a outdated model here with this commit. No official pretrained models for Yolov5s-transformer.

zhangweida2080 commented 2 years ago

@dingyiwei Do you have a reference to use this kind of structure?

Your seems only use Transformer in the last C3 layer, why not other layers?
You did not use 4 times of hidden neurons in linear layer.

dingyiwei commented 2 years ago

@zhangweida2080 You may want to take a look at my first few comments in this thread.

My original idea was from the work of Google.
If I put the component in other layers, the model would become very huge and could be hardly trained.
I actually didn't work on CV for a long time and this thread was started more than 1 year ago... But I'm a consequentialist so I didn't care about what should I put in the network but how could I reach a better result. That was why I applied Transformer. Is there a fixed thinking in the related area that we must put 4 times of hidden neurons in a linear layer (in a Transformer structure)? If it really works, welcome to put a better experimental result here :)

zhangweida2080 commented 2 years ago

@dingyiwei Thank you for your reply. There is no fixed thinking about the usage in different settings. However, since your original idea is from ViT (https://arxiv.org/pdf/2010.11929.pdf), I suppose you will follow the implementation of vit. There are some differences:

you use one time of hidden in the linear layer
the position embedding seems different ( a reference https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py )
my results on one costom dataset seems no improvements, so I want to know the reason of the modification over original implementation of vit

Thanks a lot.

dingyiwei commented 2 years ago

@zhangweida2080 For the first 2 questions, I had to work out a way to get a better result on it in a very short time due to my personal requirements, so I built a much simpler structure than the Transformer in that paper (but it really worked on COCO anyway) and shared here. If I got more time and more resources, I would try more structures and conduct more experiments. For the 3rd question, it's really common when you try to apply a popular model on a customized dataset, since the model could be tended to improve for some popular datasets. I cannot help on your specific problem, but I would suggest training a pre-trained model on a large dataset, collecting as much as data you could and do your best on data augmentation. Good luck :)

ultralytics / yolov5