ifnspaml / SGDepth

[ECCV 2020] Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
MIT License
200 stars 26 forks source link

Changing the backbone to DLA #11

Closed Ale0311 closed 3 years ago

Ale0311 commented 3 years ago

Hello,

In a new experiment I try to change the backbone to DLA: Deep Layer Aggregation. Here is the github link: https://github.com/ucbdrive/dla

I noticed that I can extract 6 endpoints from the DLA network. However, your encoder-decoder system only has 5 blocks. Do you have any idea if I could modify the structure so as to work with 6 blocks? I am asking because I got this error:

File "train.py", line 372, in trainer.train() File "train.py", line 340, in train self._run_epoch() File "train.py", line 258, in _run_epoch loss_depth += self._process_batch_depth(dataset, output, output_masked, batch_idx, domain_name) File "train.py", line 193, in _process_batch_depth losses_depth = self.depth_losses.compute_losses(dataset, output, output_masked) File "/home/diogene/Documents/Alexandra/SGDepth-master/losses/depth.py", line 121, in compute_losses losses = self._reprojection_losses(inputs, outputs, outputs_masked) File "/home/diogene/Documents/Alexandra/SGDepth-master/losses/depth.py", line 58, in _reprojection_losses for frame_id in frame_ids File "/home/diogene/Documents/Alexandra/SGDepth-master/losses/depth.py", line 58, in for frame_id in frame_ids File "/home/diogene/Documents/Alexandra/SGDepth-master/losses/depth.py", line 24, in _combined_reprojection_loss l1 = (pred - target).abs().mean(1, True) RuntimeError: The size of tensor a (1280) must match the size of tensor b (640) at non-singleton dimension 3

And these are the changes I made:

  1. I created a dla_encoder.py and in the forward function I returned the outputs of the 6 endpoints:
 x = self.encoder.base_layer(x)
endpoints = []
for i in range(6):
      x = getattr(self.encoder, 'level{}'.format(i))(x)
      endpoints.append(x)

return ( endpoints[0],endpoints[1], endpoints[2], endpoints[3], endpoints[4], endpoints[5])
  1. I changed in this file the self.num_ch_enc property to self.num_ch_enc = ( 16, 32, 64, 128, 256, 512). The numbers correspond to the number of channels of each endpoint.
  2. I loaded the pretrained model in the init function:
model = dla34('imagenet')
self.encoder = model
  1. In sgdepth.py I changed this property ( I wanted to add another block to the decoder also):

self.shape_dec = ( 512, 256, 128, 64, 32, 16)

  1. In partial_decoder.py, in the UpSkipBlock class I changed this if statement in the forward class because now I have 6 blocks (I changed the 5 into a 6):
        if self.pos == 6:
            x_pre = x[:self.pos - 1 ]
            x_new = x[self.pos - 1]
            x_skp = tuple()
            x_pst = x[self.pos:]
        else:
            x_pre = x[:self.pos - 1]
            x_new = x[self.pos - 1]
            x_skp = x[self.pos]
            x_pst = x[self.pos:]

I checked the size of the tensors in the function I get an error from ( _combined_reprojection_loss from depth.py) :

    def _combined_reprojection_loss(self, pred, target):
        """Computes reprojection losses between a batch of predicted and target images
        """

        # Calculate the per-color difference and the mean over all colors
        print(pred.shape)
        print(target.shape)

        # torch.Size([4, 3, 192, 640])
        # torch.Size([4, 3, 192, 640])

        if pred.shape != target.shape:
            return 0

        l1 = (pred - target).abs().mean(1, True)

        ssim = self.ssim(pred, target).mean(1, True)

        reprojection_loss = 0.85 * ssim + 0.15 * l1

        return reprojection_loss

Apparently both the prediction and the target have to be 192x640. However, when I have 6 blocks in the encoder-decoder structure, some of the predictions have this size:

    # torch.Size([4, 3, 384, 1280])

Could it be from the dla architecture? Do I need to modify some other things in the code? Or isn't the dla architecture a good fit here?

Thank you!

PS: Sorry for the long post, I just wanted to be as specific as possible.

klingner commented 3 years ago

Hello,

let's try to approach the error: from the error you get, I have the feeling that somehow you upsample the features on time too much in the decoder. The first question, is then maybe, if the DLA downsamples the features by a factor of 32 (5 times) or by a factor of 64 (6 times)? By returning 6 blocks, it might not automatically mean that you have a larger downsampling factor. However, when you add two more layers to the decoder as you did, it will definitely introduce one more upsampling. Interestingly, you get an error only in the losses, so the error source might also be in your way of connecting encoder and decoder

Still, my first guess would be the following error source: If DLA does downsample by a factor of only 32 (even though returning 6 blocks), then there are two possible solutions: (1) use only 5 blocks from DLA and keep the decoder unchanged (probably simple), or (2) adapt the decoder to not upsample in each con-block-iteration but only in the ones, where DLA downsamples (probably not so straightforward).

If that is not the case, then I would check/print the shapes of the features during the forward pass. The last feature should have the same dimension as the input image in terms of height and width. If that is not the case, then you need to modify the decoder architecture until this is the case.

I hope that helps in finding the error! If not, please let me know, then I will try to think of other solutions.

Ale0311 commented 3 years ago

Hello,

Thank you for your prompt response.

Forwarding only 5 blocks from the encoder, and not modifying the structure of the decoder was the first thing I tried before writing this issue. It worked, however I did not understand exactly how or why. Now it is clear, after you explained that the DLA might downsample features by a factor of 32, which indeed it does. Moreover, I wasn't sure it was the right solution, but now I understand it is.

I wanted, however, to also change the decoder structure. I introduced this condition in the UpSkipBlock class, in the init function:


 if self.pos != 6:
            x_new = self.up(x_new)

The problem remains, and some predictions appear to have the wrong shape. After the change, though, the tensor with the wrong shape is the 5th, whereas without this change, the tensor with the wrong shape is the third. ( No idea why, really).

In the picture below there are pairs of prediction and target shapes from the function that computes the depth loss:

Screenshot 2021-02-01 at 14 14 55

Here are other 2 photos:

Screenshot 2021-02-01 at 13 22 02 Screenshot 2021-02-01 at 13 25 47

The shape SGDEPTHDEPTH tensor is the only difference between the 2 photos. When the shape is smaller ( [4,1,24,80] ) it's when I managed to begin the training, with 5 blocks in the encoder forward function and with the decoder unchanged.

I kept trying but I cannot change the decoder architecture, so as to obtain this size [4,1,24,80] in the SGDepthDepth forward function. Do you have any idea why? Or is this what I need to do?

Thanks a lot!

klingner commented 3 years ago

Hi again, I have to admit, that the shapes you printed look a little weird to me, so I tied to put some print statements into the original code as follows:

In the encoder:

def forward(self, l_0):
        print('Encoder Layer 0 shape: ' + str(l_0.shape))
        l_0 = self.encoder.conv1(l_0)
        l_0 = self.encoder.bn1(l_0)
        l_0 = self.encoder.relu(l_0)
        print('Encoder Layer 1 shape: ' + str(l_0.shape))
        l_1 = self.encoder.maxpool(l_0)
        l_1 = self.encoder.layer1(l_1)
        print('Encoder Layer 2 shape: ' + str(l_1.shape))
        l_2 = self.encoder.layer2(l_1)
        print('Encoder Layer 3 shape: ' + str(l_2.shape))
        l_3 = self.encoder.layer3(l_2)
        print('Encoder Layer 4 shape: ' + str(l_3.shape))
        l_4 = self.encoder.layer4(l_3)
        print('Encoder Layer 5 shape: ' + str(l_4.shape))

        return (l_0, l_1, l_2, l_3, l_4)

Same in the decoder in the Upskipblock Function:

# upscale the input:
if self.pos == 1:
    print('Decoder Layer 0 shape: ' + str(x_new.shape))
x_new = self.up(x_new)
print('Decoder Layer ' + str(self.pos) + ' shape: ' + str(x_new.shape))

This way you should be able to correctly observe the features passed through the encoder and the decoder. I get an output that looks like this:

Encoder Layer 0 shape: torch.Size([12, 3, 192, 640])
Encoder Layer 1 shape: torch.Size([12, 64, 96, 320])
Encoder Layer 2 shape: torch.Size([12, 64, 48, 160])
Encoder Layer 3 shape: torch.Size([12, 128, 24, 80])
Encoder Layer 4 shape: torch.Size([12, 256, 12, 40])
Encoder Layer 5 shape: torch.Size([12, 512, 6, 20])
Decoder Layer 0 shape: torch.Size([6, 256, 6, 20])
Decoder Layer 1 shape: torch.Size([6, 256, 12, 40])
Decoder Layer 2 shape: torch.Size([6, 128, 24, 80])
Decoder Layer 3 shape: torch.Size([6, 64, 48, 160])
Decoder Layer 4 shape: torch.Size([6, 32, 96, 320])
Decoder Layer 5 shape: torch.Size([6, 16, 192, 640])
Encoder Layer 0 shape: torch.Size([6, 6, 192, 640])
Encoder Layer 1 shape: torch.Size([6, 64, 96, 320])
Encoder Layer 2 shape: torch.Size([6, 64, 48, 160])
Encoder Layer 3 shape: torch.Size([6, 128, 24, 80])
Encoder Layer 4 shape: torch.Size([6, 256, 12, 40])
Encoder Layer 5 shape: torch.Size([6, 512, 6, 20])
Encoder Layer 0 shape: torch.Size([6, 6, 192, 640])
Encoder Layer 1 shape: torch.Size([6, 64, 96, 320])
Encoder Layer 2 shape: torch.Size([6, 64, 48, 160])
Encoder Layer 3 shape: torch.Size([6, 128, 24, 80])
Encoder Layer 4 shape: torch.Size([6, 256, 12, 40])
Encoder Layer 5 shape: torch.Size([6, 512, 6, 20])
Decoder Layer 0 shape: torch.Size([6, 256, 6, 20])
Decoder Layer 1 shape: torch.Size([6, 256, 12, 40])
Decoder Layer 2 shape: torch.Size([6, 128, 24, 80])
Decoder Layer 3 shape: torch.Size([6, 64, 48, 160])
Decoder Layer 4 shape: torch.Size([6, 32, 96, 320])
Decoder Layer 5 shape: torch.Size([6, 16, 192, 640])

You can already see that the shapes in the encoder and decoder correctly correspond to each other in terms of height and width dimensions. Could you maybe include these print statements into your code with the DLA architecture and see, if any of the shapes is above 192, 640? The next step would then maybe be to find out, why this happens. I am not completely sure yet, why this happens. Your changes you described in the initial post, seem reasonable at first glance, so we maybe have to look into the forward-pass shapes to see why things do not work out as expected.

Ale0311 commented 3 years ago

Hello,

Thank you for your response! So, I tried your approach and as expected, the dimensions of the tensors are as they should be. That's why I was so surprised I got an error in the function that computes the depth loss:

From the encoder I get:

Base layer shape torch.Size([8, 16, 192, 640]) Level 0 shape torch.Size([8, 16, 192, 640]) Level 1 shape torch.Size([8, 32, 96, 320]) Level 2 shape torch.Size([8, 64, 48, 160]) Level 3 shape torch.Size([8, 128, 24, 80]) Level 4 shape torch.Size([8, 256, 12, 40]) Level 5 shape torch.Size([8, 512, 6, 20])

From the decoder I get ( with the modifications I made and presented in the previous comments ):

Decoder Layer 0 shape: torch.Size([4, 512, 6, 20]) Decoder Layer 1 shape: torch.Size([4, 512, 12, 40]) Decoder Layer 2 shape: torch.Size([4, 256, 24, 80]) Decoder Layer 3 shape: torch.Size([4, 128, 48, 160]) Decoder Layer 4 shape: torch.Size([4, 64, 96, 320]) Decoder Layer 5 shape: torch.Size([4, 32, 192, 640]) Decoder Layer 6 shape: torch.Size([4, 16, 192, 640])

But the same error occurs because the pred and target tensors have a different size:

Screenshot 2021-02-09 at 11 01 30
klingner commented 3 years ago

Hello,

thank you for the additional information. As the error is not in the network architecture, I think I have another idea. The predictions are upsampled for loss calculation. In lines 136-142 of perspective_resample.py there is an upscaling written as:

disps = tuple(
            functional.interpolate(
                outputs["disp", res], scale_factor=2**res,
                mode="bilinear", align_corners=False
            )
            for res in resolutions
        )

Actually the predictions are upscaled by 2**res, which does not hold if two predictions come at the same resolution (which is the case for your architecture as I understand). This is maybe not ideally written, as it would be better to calculate the target shape and then resize to that shape. I might change that in the future. For your specific problem, however, I think you just need to adapt the scale factor such that it is not multiplied by 2 for the last prediction.

Ale0311 commented 3 years ago

Hello,

Thank you very much for your prompt response. I managed to get it working just by making this small change, in the line you indicated:

        disps = tuple(
            functional.interpolate(
                outputs["disp", res], **size=(192,640)**,
                mode="bilinear", align_corners=False
            )
            for res in resolutions
        )

This way, the interpolation function will always resize the prediction to the size of the target. I will let you know about the results after the training is complete. Thank you again for your invaluable help! 😊

Ale0311 commented 3 years ago

Hello again!

As promised, I want to present the results after training.

These are the results when training with the decoder unmodified and passing forward only 5 blocks in the encoder:

abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.115 & 0.816 & 4.644 & 0.190 & 0.875 & 0.961 & 0.982 \

These are the results when training with a modified decoder structure ( 6 blocks ) and also 6 blocks in the encoder:

abs_rel | sq_rel | rmse | rmse_log | a1 | a2 | a3 | & 0.115 & 0.839 & 4.703 & 0.192 & 0.875 & 0.960 & 0.982 \

The former performs slightly better.

Thank you for your help!