raoyongming / DenseCLIP

[CVPR 2022] DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
520 stars 40 forks source link

Identity Head Question #57

Open jhkwag970 opened 1 month ago

jhkwag970 commented 1 month ago

hello, Thank you for sharing your great work!

When you are matching pixel-text score loss, you are using cross entropy between score map S and ground truth segmentation label Y.

However, shape of S (H4 * W4, K) and Y (H,W,1) cannot be matched, because S is gained from feature map from encoder and have smaller H and W.

In the code, this is done through identity head

@HEADS.register_module()
class IdentityHead(BaseDecodeHead):
    """Panoptic Feature Pyramid Networks.
    This head is the implementation of `Semantic FPN
    <https://arxiv.org/abs/1901.02446>`_.
    Args:
        feature_strides (tuple[int]): The strides for input feature maps.
            stack_lateral. All strides suppose to be power of 2. The first
            one is of largest resolution.
    """

    def __init__(self, **kwargs):
        super(IdentityHead, self).__init__(
            input_transform=None, **kwargs)
        self.conv_seg = None

    def forward(self, inputs):
        return inputs

But I do not see clear connection how to match size between shape of S and Y.

Did you simply downsample ground truth map Y to match S?

Thank you!