ifnspaml / SGDepth

[ECCV 2020] Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
MIT License
200 stars 26 forks source link

Some decoder questions #20

Closed Ale0311 closed 3 years ago

Ale0311 commented 3 years ago

Hello!

I have some curiosities about the decoder and I hope that you can provide some explanations for me.

So here are the 2 classes for the depth and segmentation decoders.

class SGDepthDepth(nn.Module):
    def __init__(self, common, resolutions=1):
        super().__init__()

        self.resolutions = resolutions

        self.decoder = networks.PartialDecoder.gen_tail(common.decoder)
        print('*******',self.decoder)
        self.multires = networks.MultiResDepth(self.decoder.chs_x()[-resolutions:])

    def forward(self, *x):
        x = self.decoder(*x)
        x = self.multires(*x[-self.resolutions:])
        # print('SGDEPTHDEPTH size', x[0].shape)
        return x
class SGDepthSeg(nn.Module):
    def __init__(self, common):
        super().__init__()

        self.decoder = networks.PartialDecoder.gen_tail(common.decoder)
        self.multires = networks.MultiResSegmentation(self.decoder.chs_x()[-1:])
        self.nl = nn.Softmax2d()

    def forward(self, *x):
        x = self.decoder(*x)
        x = self.multires(*x[-1:])
        x_lin = x[-1]

        return x_lin

Here is the MultiRes class:

class MultiRes(nn.Module):
    """ Directly generate target-space outputs from (intermediate) decoder layer outputs
    Args:
        dec_chs: A list of decoder output channel counts
        out_chs: output channels to generate
        pp: A function to call on any output tensor
            for post-processing (like e.g. a non linear activation)
    """

    def __init__(self, dec_chs, out_chs, pp=None):
        super().__init__()

        self.pad = nn.ReflectionPad2d(1)

        self.convs = nn.ModuleList(
            nn.Conv2d(in_chs, out_chs, 3)
            for in_chs in dec_chs[::-1]
        )

        self.pp = pp if (pp is not None) else self._identity_pp

    def _identity_pp(self, x):
        return x

    def forward(self, *x):
        out = tuple(
            self.pp(conv(self.pad(inp)))
            for conv, inp in zip(self.convs[::-1], x)
        )

        return out

These are my questions:

  1. Where is the self.nl = nn.Softmax2d() used in the segmentation decoder? I checked the multires class and while for the depth the nonliniarity is given as an argument here: super().__init__(dec_chs, out_chs, nn.Sigmoid()) I couldn't find where the softmax function is used for the segmentation part: ' super().init(dec_chs, out_chs)'

  2. What does the multires class do exactly? This is my understanding so far, please correct me if I'm wrong: they take the last element from all the 6 tuples that are in a batch and for the segmentation class the output will have 20 channels because there are 20 segmentation classes, while the depth class will have only one channel because the depth map has a HxW dimension.

self.convs = nn.ModuleList(
            nn.Conv2d(in_chs, out_chs, 3)
            for in_chs in dec_chs[::-1]
        )

As far as these lines of code from MultiRes class are concerned: why is there a for? Because from this line of code self.multires = networks.MultiResDepth(self.decoder.chs_x()[-resolutions:]), where resolutions is 1, I understand that the MultiResDepth only takes one value as a parameter, the value '16', which is the number of output channels for the decoder (the last layer).

Thank you in advance!

klingner commented 3 years ago

Hey, let me see what I can do:

  1. The non-linearity is actually not used, so the code line self.nl = nn.Softmax2d() could probably be erased without effect. The cross entropy loss in pytorch actually expects the logits which are then automatically processed by a softmax function. Probably I did forget that in between and forgot to delete this line.
  2. I think your understanding is correct. Basically the class takes the output of the decoders at different levels and puts it through a final convolution (with the needed number of output channels). This allows for an application of the loss at different levels of the decoder outputs. The depth for example is trained on four different decoder outputs, while the segmentation is trained only on the final output.
  3. Actually 1 is just the standard setting of the class. However, the value is overwritten by the argument parser's --model-depth-resolutions whose standard value is 4. The for loop is therefore needed, as the depth is processed at four different resoutions of the decoder. Each resolution needs an output convolution.

I hope this helps!

Ale0311 commented 3 years ago

Thank you very much for your explanations! Everything makes sense now. 😊