Open hrbigelow opened 2 years ago
I noticed in the paper each Prediction Convolution is formulated with output channels = n_boxes * (n_classes + 4), but in the code you have separated each level into separate convolutions.
self.loc_conv4_3 = nn.Conv2d(512, n_boxes['conv4_3'] * 4, kernel_size=3, padding=1) self.loc_conv7 = nn.Conv2d(1024, n_boxes['conv7'] * 4, kernel_size=3, padding=1) self.loc_conv8_2 = nn.Conv2d(512, n_boxes['conv8_2'] * 4, kernel_size=3, padding=1) self.loc_conv9_2 = nn.Conv2d(256, n_boxes['conv9_2'] * 4, kernel_size=3, padding=1) self.loc_conv10_2 = nn.Conv2d(256, n_boxes['conv10_2'] * 4, kernel_size=3, padding=1) self.loc_conv11_2 = nn.Conv2d(256, n_boxes['conv11_2'] * 4, kernel_size=3, padding=1) # Class prediction convolutions (predict classes in localization boxes) self.cl_conv4_3 = nn.Conv2d(512, n_boxes['conv4_3'] * n_classes, kernel_size=3, padding=1) self.cl_conv7 = nn.Conv2d(1024, n_boxes['conv7'] * n_classes, kernel_size=3, padding=1) self.cl_conv8_2 = nn.Conv2d(512, n_boxes['conv8_2'] * n_classes, kernel_size=3, padding=1) self.cl_conv9_2 = nn.Conv2d(256, n_boxes['conv9_2'] * n_classes, kernel_size=3, padding=1) self.cl_conv10_2 = nn.Conv2d(256, n_boxes['conv10_2'] * n_classes, kernel_size=3, padding=1) self.cl_conv11_2 = nn.Conv2d(256, n_boxes['conv11_2'] * n_classes, kernel_size=3, padding=1)...
But, I believe if it were implemented as in the paper, it should be:
self.conv4_3 = nn.Conv2d(512, n_boxes['conv4_3'] * (4 + n_classes), kernel_size=3, padding=1) self.conv7 = nn.Conv2d(1024, n_boxes['conv7'] * (4 + n_classes), kernel_size=3, padding=1) self.conv8_2 = nn.Conv2d(512, n_boxes['conv8_2'] * (4 + n_classes), kernel_size=3, padding=1) self.conv9_2 = nn.Conv2d(256, n_boxes['conv9_2'] * (4 + n_classes), kernel_size=3, padding=1) self.conv10_2 = nn.Conv2d(256, n_boxes['conv10_2'] * (4 + n_classes), kernel_size=3, padding=1) self.conv11_2 = nn.Conv2d(256, n_boxes['conv11_2'] * (4 + n_classes), kernel_size=3, padding=1)
Did you try it the original way, or was this an intentional choice for some reason?
Thank you!
I noticed in the paper each Prediction Convolution is formulated with output channels = n_boxes * (n_classes + 4), but in the code you have separated each level into separate convolutions.
But, I believe if it were implemented as in the paper, it should be:
Did you try it the original way, or was this an intentional choice for some reason?
Thank you!