About object detection training

louis624 commented 2 years ago

Dear authors Thank you for the great paper and its model architecture.

I have some questions related to the object detections in your paper.

In Section 4.2 (Object Detection), it is written as the following:

We validate PiT through object detection on COCO dataset [24] in Deformable-DETR [44]. ... Since the original image resolution is too large for transformer-based backbones, we halve the image resolution for training and test of all backbones.

So, my questions are

Is PiT for object detection trained with a fixed size of 667 by 400 (half of 1333 and 800)? If so, were the images zero padded in case where the resized images were smaller than the size (667 by 400)?
For object detection, it is clear that the input data size are different than the data for image classification. Then, does patch size of PiT changes? or the number of patches changes?
If number of patches for detection were kept the same as for the image classification, than does the patch embedding (conv_embedding) has larger kernel sizes?

Thank you in advance.

bhheo commented 2 years ago

Hi @louis624

Thank you for your interest in our paper. Here are my answers.

1. Is PiT for object detection trained with a fixed size of 667 by 400 (half of 1333 and 800)? If so, were the images zero padded in case where the resized images were smaller than the size (667 by 400)?

I'm sorry for make confusion. I will explain our detection setting in detail. We changed these lines from the official Deformable-DETR code. https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/datasets/coco.py#L132-L152

Original

    scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]

    if image_set == 'train':
        return T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomSelect(
                T.RandomResize(scales, max_size=1333),
                T.Compose([
                    T.RandomResize([400, 500, 600]),
                    T.RandomSizeCrop(384, 600),
                    T.RandomResize(scales, max_size=1333),
                ])
            ),
            normalize,
        ])

    if image_set == 'val':
        return T.Compose([
            T.RandomResize([800], max_size=1333),
            normalize,
        ])

Ours

    scales = [400 - i * 16 for i in range(11)]

    if image_set == 'train':
        return T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomSelect(
                T.RandomResize(scales, max_size=666),
                T.Compose([
                    T.RandomResize([200, 250, 300]),
                    T.RandomSizeCrop(192, 300),
                    T.RandomResize(scales, max_size=666),
                ])
            ),
            normalize,
        ])

    if image_set == 'val':
        return T.Compose([
            T.RandomResize([400], max_size=666),
            normalize,
        ])

So, it is not the fixed size setting and we didn't use extra code for zero padding.

2. For object detection, it is clear that the input data size are different than the data for image classification. Then, does patch size of PiT changes? or the number of patches changes?

3. If number of patches for detection were kept the same as for the image classification, than does the patch embedding (conv_embedding) has larger kernel sizes?

When the input size changes, PiT uses a different number of patches. We didn't change the kernel size of patch_embedding for object detection.

I think PiT code used for Deformable-DETR can be a clear answer to this question. We use features instead of cls_token in image classification. We also interpolate pos_embed when the network processes a different input size. But, we didn't change the kernel_size of patch_embedding

class PoolingTransformer(nn.Module):
    def __init__(self, image_size, patch_size, stride,
                 num_classes, base_dims, depth, heads, mlp_ratio, in_chans=3,
                 attn_drop_rate=.0, drop_rate=.0, drop_path_rate=.0,
                 replace_stride_with_dilation=None):
        super(PoolingTransformer, self).__init__()

        total_block = sum(depth)
        padding = 0
        block_idx = 0

        if replace_stride_with_dilation is None:
            replace_stride_with_dilation = [False, False]
        self.dilation = 1

        width = math.floor(
            (image_size + 2 * padding - patch_size) / stride + 1)

        self.base_dims = base_dims
        self.heads = heads
        self.num_classes = num_classes

        self.patch_size = patch_size
        self.pos_embed = nn.Parameter(
            torch.randn(1, base_dims[0] * heads[0], width, width),
            requires_grad=True)
        self.patch_embed = conv_embedding(in_chans, base_dims[0] * heads[0],
                                          patch_size, stride, padding)

        self.cls_token = nn.Parameter(
            torch.randn(1, 1, base_dims[0] * heads[0]),
            requires_grad=True)
        self.pos_drop = nn.Dropout(p=drop_rate)

        self.transformers = nn.ModuleList([])
        self.pools = nn.ModuleList([])

        for stage in range(len(depth)):
            drop_path_prob = [drop_path_rate * i / total_block
                              for i in
                              range(block_idx, block_idx + depth[stage])]
            block_idx += depth[stage]

            self.transformers.append(
                Transformer(base_dims[stage], depth[stage], heads[stage],
                            mlp_ratio,
                            drop_rate, attn_drop_rate, drop_path_prob)
            )
            if stage < len(heads) - 1:
                stride = 2
                if replace_stride_with_dilation[stage]:
                    self.dilation *= stride
                    stride = 1
                self.pools.append(
                    conv_head_pooling(base_dims[stage] * heads[stage],
                                      base_dims[stage + 1] * heads[stage + 1],
                                      stride=stride,
                                      dilation=self.dilation)
                )

        self.norm = nn.LayerNorm(base_dims[-1] * heads[-1], eps=1e-6)

        # Classifier head
        self.head = nn.Linear(base_dims[-1] * heads[-1],
                              num_classes) if num_classes > 0 else nn.Identity()

        trunc_normal_(self.pos_embed, std=.02)
        trunc_normal_(self.cls_token, std=.02)
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    @torch.jit.ignore
    def no_weight_decay(self):
        return {'pos_embed', 'cls_token'}

    def get_classifier(self):
        return self.head

    def reset_classifier(self, num_classes, global_pool=''):
        self.num_classes = num_classes
        self.head = nn.Linear(self.embed_dim,
                              num_classes) if num_classes > 0 else nn.Identity()

    def no_grad_head(self):
        self.head.weight.requires_grad_(False)
        self.head.bias.requires_grad_(False)
        self.norm.weight.requires_grad_(False)
        self.norm.bias.requires_grad_(False)

    def change_resolution(self, h, w):
        self.pos_embed = nn.Parameter(
            F.interpolate(self.pos_embed.data, (h, w), mode='bicubic'),
            requires_grad=True
        )

    def forward_features(self, x):
        x = self.patch_embed(x)

        if x.shape[2:4] == self.pos_embed.shape[2:4]:
            pos_embed = self.pos_embed
        else:
            pos_embed = F.interpolate(self.pos_embed, x.shape[2:4],
                                      mode='bicubic')

        x = self.pos_drop(x + pos_embed)
        cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)

        features = []

        for stage in range(len(self.pools)):
            x, cls_tokens = self.transformers[stage](x, cls_tokens)
            features.append(x)
            x, cls_tokens = self.pools[stage](x, cls_tokens)
        x, cls_tokens = self.transformers[-1](x, cls_tokens)

        features.append(x)

        return features, cls_tokens

    def forward(self, x):
        features, cls_tokens = self.forward_features(x)
        return features

I hope my answers solve your questions about our detection setting. Please let me know if you have any further questions.

Best

louis624 commented 2 years ago

Thank you for the detailed explanation about my questions!!!

Just one more question about the architecture.

In the architecture that you have shared, there is dilation argument for conv_head_pooling which does not exist conv_head_poling class.

self.pools.append(
                    conv_head_pooling(base_dims[stage] * heads[stage], 
                        base_dims[stage + 1] * heads[stage + 1], stride=stride, dilation=self.dilation)
                )

In this case, since self.dilation is just 1, which is the default value of torch.nn.Conv2d, can I just ignore the dilation?

Thank you!

bhheo commented 2 years ago

Yes. you can ignore the dilation option.

Because Deformable-DETR supports the dilation option for backbone network, I implemented it for PiT. But, I didn't use it for experiments. So, you can simply ignore it.

louis624 commented 2 years ago

Great! Thank you for the detailed explanations!!

naver-ai / pit