some details - Githubissues

laomao0 commented 4 years ago

Hi, Snikalus, I have 2 questions about the details of your network.

(a) In this picture, the warped frames and warped first-level features are concatenated [torch.cat], right? (b) If it is right? for example, for the first reference image I_0, and first-level features of I_0, they have (32+3) channels, right? (c) If (b) is right, the input to the gridnet is concatenate{warped_I0, warped_feature_I0, wapred_I1, wapred_feature_I1 }, totally 70 channels, right? and the initial LateralBlock of gridnet down the 70 channel to 32 channel. Is my understanding, right?

the second question is about feature warping. your network extracts three pyramid features. For example, if input img is 1x3x256x256 [N C H W], the first-level feature is [1x32x256x256], the second-level feature is [1x64x128x128], the third-level feature is [1x96x64x64]. When warping the second-level and thrid level feature, we need the optical flow F. The initial flow is [1x2x256x256], you down-sample the flow F to [1x2x128x128] to warp the second-level flow, right? for example? nn.functional.interpolate(metric0, scale_factor=0.5, mode = 'bilinear', align_corners = False)

thanks for your reply~

sniklaus commented 4 years ago

Yes, the warped frames and the warped first-level features are concatenated. Yes, the input to the green lateral block on the top left is 70 channels (3 + 3 + 32 + 32). Yes, the output of the block on the top left is 32 (the number in each block denotes the number of output channels).

Yes, the optical flow needs to be downsampled. Note that it is common for optical flow estimators based on deep learning to yield a low-res prediction (in the case for PWC-Net, the estimate is 1/4th of the input resolution that is then upsampled to get the full-resolution prediction). So you may not have to downsample the flow for the coarse levels but upsample the flow for the fine levels.

laomao0 commented 4 years ago

thanks for your reply!

laomao0 commented 4 years ago

Yes, the warped frames and the warped first-level features are concatenated. Yes, the input to the green lateral block on the top left is 70 channels (3 + 3 + 32 + 32). Yes, the output of the block on the top left is 32 (the number in each block denotes the number of output channels).

Yes, the optical flow needs to be downsampled. Note that it is common for optical flow estimators based on deep learning to yield a low-res prediction (in the case for PWC-Net, the estimate is 1/4th of the input resolution that is then upsampled to get the full-resolution prediction). So you may not have to downsample the flow for the coarse levels but upsample the flow for the fine levels.

thanks for your reply! Using PWC-net, when I upsample the flow of 1/4 resolution to 1/1 resolution (for example 10x10 -> 40x40 pixels), the multiplication factor is 20 in the PWC-net code [line 305 https://github.com/sniklaus/pytorch-pwc/blob/master/run.py].

If I need to up-sample the 1/4 flow to 1/2 resolution (for example flow of 10x10 -> 20 x 20), the factor is 20/2=10. Is it right?

For example, modify [line 305 https://github.com/sniklaus/pytorch-pwc/blob/master/run.py].

meta_flow = self.forward_pre(tensorPreprocessedFirst, tensorPreprocessedSecond)

tensorFlow_L1 = 20.0 * torch.nn.functional.interpolate(input=meta_flow, size=(intHeight, intWidth), mode='bilinear', align_corners=False)

tensorFlow_L2 = 20.0 / 2.0 * torch.nn.functional.interpolate(input=meta_flow, size=(int(intHeight/2), int(intWidth/2)), mode='bilinear', align_corners=False)

tensorFlow_L3 = 20.0 / 4.0 * torch.nn.functional.interpolate(input=meta_flow, size=(int(intHeight/4), int(intWidth/4)), mode='bilinear', align_corners=False)

Thanks for your patience.

sniklaus commented 4 years ago

You are correct indeed. I would not necessarily do the interpolation the way you outlined though. The predicted meta_flow may already be at the resolution of tensorFlow_L3, so interpolation may not be necessary. As for tensorFlow_L2, I would compute it from 2 * upsample(tensorFlow_L3) and likewise for tensorFlow_L1, I would compute it from 2 * upsample(tensorFlow_L2).

laomao0 commented 4 years ago

You are correct indeed. I would not necessarily do the interpolation the way you outlined though. The predicted meta_flow may already be at the resolution of tensorFlow_L3, so interpolation may not be necessary. As for tensorFlow_L2, I would compute it from 2 * upsample(tensorFlow_L3) and likewise for tensorFlow_L1, I would compute it from 2 * upsample(tensorFlow_L2).

Thanks for your reply, that helps me a lot.

sniklaus / softmax-splatting

some details #12