Closed louis624 closed 2 years ago
Hi @louis624
Thank you for your interest in our paper. Here are my answers.
I'm sorry for make confusion. I will explain our detection setting in detail. We changed these lines from the official Deformable-DETR code. https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/datasets/coco.py#L132-L152
Original
scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
if image_set == 'train':
return T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=1333),
T.Compose([
T.RandomResize([400, 500, 600]),
T.RandomSizeCrop(384, 600),
T.RandomResize(scales, max_size=1333),
])
),
normalize,
])
if image_set == 'val':
return T.Compose([
T.RandomResize([800], max_size=1333),
normalize,
])
Ours
scales = [400 - i * 16 for i in range(11)]
if image_set == 'train':
return T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=666),
T.Compose([
T.RandomResize([200, 250, 300]),
T.RandomSizeCrop(192, 300),
T.RandomResize(scales, max_size=666),
])
),
normalize,
])
if image_set == 'val':
return T.Compose([
T.RandomResize([400], max_size=666),
normalize,
])
So, it is not the fixed size setting and we didn't use extra code for zero padding.
When the input size changes, PiT uses a different number of patches.
We didn't change the kernel size of patch_embedding
for object detection.
I think PiT code used for Deformable-DETR can be a clear answer to this question.
We use features
instead of cls_token
in image classification.
We also interpolate pos_embed
when the network processes a different input size.
But, we didn't change the kernel_size
of patch_embedding
class PoolingTransformer(nn.Module):
def __init__(self, image_size, patch_size, stride,
num_classes, base_dims, depth, heads, mlp_ratio, in_chans=3,
attn_drop_rate=.0, drop_rate=.0, drop_path_rate=.0,
replace_stride_with_dilation=None):
super(PoolingTransformer, self).__init__()
total_block = sum(depth)
padding = 0
block_idx = 0
if replace_stride_with_dilation is None:
replace_stride_with_dilation = [False, False]
self.dilation = 1
width = math.floor(
(image_size + 2 * padding - patch_size) / stride + 1)
self.base_dims = base_dims
self.heads = heads
self.num_classes = num_classes
self.patch_size = patch_size
self.pos_embed = nn.Parameter(
torch.randn(1, base_dims[0] * heads[0], width, width),
requires_grad=True)
self.patch_embed = conv_embedding(in_chans, base_dims[0] * heads[0],
patch_size, stride, padding)
self.cls_token = nn.Parameter(
torch.randn(1, 1, base_dims[0] * heads[0]),
requires_grad=True)
self.pos_drop = nn.Dropout(p=drop_rate)
self.transformers = nn.ModuleList([])
self.pools = nn.ModuleList([])
for stage in range(len(depth)):
drop_path_prob = [drop_path_rate * i / total_block
for i in
range(block_idx, block_idx + depth[stage])]
block_idx += depth[stage]
self.transformers.append(
Transformer(base_dims[stage], depth[stage], heads[stage],
mlp_ratio,
drop_rate, attn_drop_rate, drop_path_prob)
)
if stage < len(heads) - 1:
stride = 2
if replace_stride_with_dilation[stage]:
self.dilation *= stride
stride = 1
self.pools.append(
conv_head_pooling(base_dims[stage] * heads[stage],
base_dims[stage + 1] * heads[stage + 1],
stride=stride,
dilation=self.dilation)
)
self.norm = nn.LayerNorm(base_dims[-1] * heads[-1], eps=1e-6)
# Classifier head
self.head = nn.Linear(base_dims[-1] * heads[-1],
num_classes) if num_classes > 0 else nn.Identity()
trunc_normal_(self.pos_embed, std=.02)
trunc_normal_(self.cls_token, std=.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token'}
def get_classifier(self):
return self.head
def reset_classifier(self, num_classes, global_pool=''):
self.num_classes = num_classes
self.head = nn.Linear(self.embed_dim,
num_classes) if num_classes > 0 else nn.Identity()
def no_grad_head(self):
self.head.weight.requires_grad_(False)
self.head.bias.requires_grad_(False)
self.norm.weight.requires_grad_(False)
self.norm.bias.requires_grad_(False)
def change_resolution(self, h, w):
self.pos_embed = nn.Parameter(
F.interpolate(self.pos_embed.data, (h, w), mode='bicubic'),
requires_grad=True
)
def forward_features(self, x):
x = self.patch_embed(x)
if x.shape[2:4] == self.pos_embed.shape[2:4]:
pos_embed = self.pos_embed
else:
pos_embed = F.interpolate(self.pos_embed, x.shape[2:4],
mode='bicubic')
x = self.pos_drop(x + pos_embed)
cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
features = []
for stage in range(len(self.pools)):
x, cls_tokens = self.transformers[stage](x, cls_tokens)
features.append(x)
x, cls_tokens = self.pools[stage](x, cls_tokens)
x, cls_tokens = self.transformers[-1](x, cls_tokens)
features.append(x)
return features, cls_tokens
def forward(self, x):
features, cls_tokens = self.forward_features(x)
return features
I hope my answers solve your questions about our detection setting. Please let me know if you have any further questions.
Best
Thank you for the detailed explanation about my questions!!!
Just one more question about the architecture.
In the architecture that you have shared, there is dilation argument for conv_head_pooling which does not exist conv_head_poling class.
self.pools.append(
conv_head_pooling(base_dims[stage] * heads[stage],
base_dims[stage + 1] * heads[stage + 1], stride=stride, dilation=self.dilation)
)
In this case, since self.dilation is just 1, which is the default value of torch.nn.Conv2d, can I just ignore the dilation?
Thank you!
Yes. you can ignore the dilation option.
Because Deformable-DETR supports the dilation option for backbone network, I implemented it for PiT. But, I didn't use it for experiments. So, you can simply ignore it.
Great! Thank you for the detailed explanations!!
Dear authors Thank you for the great paper and its model architecture.
I have some questions related to the object detections in your paper.
In Section 4.2 (Object Detection), it is written as the following:
So, my questions are
Thank you in advance.