sniklaus / sepconv-slomo

an implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch
1.01k stars 170 forks source link

Perceptual loss #46

Closed issakh closed 4 years ago

issakh commented 4 years ago

Hi, in your paper you mentioned the use of perceptual loss, did you apply any pre-processing to the training set as basically I'm getting a contrast issue despite performing normalisation using the imageNet mean and standard deviation values. I've tried it without doing this, but the output is more or less the same I'd appreciate it if you had any ideas as to where I'm going wrong in this case Thanks

sniklaus commented 4 years ago

Can you provide more details on the training? Do you train with perceptual loss from scratch? Also, are you encountering this issue when training the provided model?

issakh commented 4 years ago

I have tried it in both scenarios, L1 then perceptual and training with perceptual from scratch, loss is very high around 1500 which seems weird. In terms of output image quality, it is very good, but with the brightness problem it isn't of much use. Could it do with the implementation of the loss? I'm using this implementation: https://github.com/martkartasev/sepconv

sniklaus commented 4 years ago

Could you provide an example input and its prediction before and after fine-tuning with perceptual loss?

issakh commented 4 years ago

percp output

The first image should be the one which was done using L1 then perceptual and the second just L1

sniklaus commented 4 years ago

There definitely is a brightness issue. What are the details of the perceptual loss that you are using?

issakh commented 4 years ago

It is this one (from https://github.com/martkartasev/sepconv/blob/master/src/loss.py):

class VggLoss(nn.Module):
    def __init__(self):
        super(VggLoss, self).__init__()

        model = torchvision.models.vgg19(pretrained=True).cuda()

        self.features = nn.Sequential(
            # stop at relu4_4 (-10)
            *list(model.features.children())[:-10]
        )

        for param in self.features.parameters():
            param.requires_grad = False

    def forward(self, output, target):
        outputFeatures = self.features(output)
        targetFeatures = self.features(target)

        loss = torch.norm(outputFeatures - targetFeatures, 2)

        return config.VGG_FACTOR * loss
sniklaus commented 4 years ago

Looks reasonable. You said you are already making sure to normalize the mean and the standard deviation when using this loss - are you also making sure that you have the right color channel ordering?

issakh commented 4 years ago

The image is RGB, which should be the right input, the only thing that would differ to the original vgg network is the input image size

sniklaus commented 4 years ago

The loss should have a sum or a mean somewhre, there is no reduction in torch.norm(...) after all. Which one are you using?

issakh commented 4 years ago

I use the loss as attached as I thought it was complete, would something like this have to be implemented?

reg_loss = REGULARIZATION * (
    torch.sum(torch.abs(y[:, :, :, :-1] - y[:, :, :, 1:])) + 
    torch.sum(torch.abs(y[:, :, :-1, :] - y[:, :, 1:, :]))
)
sniklaus commented 4 years ago

What you quoted is a regularization loss, which is different. I am afraid that I am unable to assist you on this further, there are too many unknowns without the entire source code. My apologies.

issakh commented 4 years ago

Hi, sorry to bother you, I think the issue I'm facing has to do with backpropagation of the sepconv layer (tried some other losses and a similar issue surfaced) , I appreciate if you could take a look at this backwards implementation I'm using: https://github.com/HyeongminLEE/pytorch-sepconv/blob/master/sepconv.py Really sorry for wasting your time, I have been trying to get things to work for such a long time and to narrow down the cause of the problem, so it would be much appreciated if you could help

sniklaus commented 4 years ago

The backwards implementation looks good to me, you could compare the numerical and analytical gradients to verify this further.

issakh commented 4 years ago

Hi, thanks for your help, I have managed to get it working. Do you by any chance have a link on how to calculate the statistical and error values for the Middlebury set?

sniklaus commented 4 years ago

Good news, congrats! There is a benchmark.py, does that have everything you need?

issakh commented 4 years ago

Hi, what I'm looking for is the method to calculate these for the middlebury set: Endpoint, angle , interpolation , normalized interpolation errors, Statistics for endpoint error:     Average ,  SD  , R0.5  , R1.0  , R2.0  , A50  , A75   ,A95  

sniklaus commented 4 years ago

Please see: http://vision.middlebury.edu/flow/submit/

issakh commented 4 years ago

I have taken a look at this and the code in the page, it is for saving the flow vectors which this sepconv algorithm doesn't compute. The other option available was to attach the interpolated frames (though not clear with the flow files as well or not). Did you go with the second option when submitting the evaluation results?

sniklaus commented 4 years ago

Yes, and most if not all interpolation papers have "Interpolation results only." stated in their description at the results page: http://vision.middlebury.edu/flow/eval/results/results-i1.php

I am not sure whether you really want to submit your results there though, make sure to read: "We will only be able to evaluate original, "final" results ready to be submitted to a conference or journal."

issakh commented 4 years ago

Thanks for your response, I wanted to know if I could replicate the results in some way, I guess, PSNR and SSIM is the closest I can do. I've been trying to familiarize myself with your paper and work well, as I am currently doing some work modifying it and trying to improve performance. Thanks again

issakh commented 4 years ago

Hi, in your paper you noted that your method needs 1.27GB of memory to interpolate a 1080p frame. What methodology did you use for calculating this? Thanks

sniklaus commented 4 years ago

I checked nvidia-smi while running it on 1080p footage. This isn't very accurate though and only gives you a high water mark,. Nowadays, you can use torch.cuda.memory_summary() for more detailed metrics but that didn't exist back when I ran the experiments. Note that this approach of measuring the memory usage also depends on the underlying convolution implementation. I might have also additionally set torch.cuda.cudnn.enabled = False to measure the memory usage with PyTorch's own convolution implementation.

issakh commented 4 years ago

Thanks for that, I've just got one more question, how did you manage to get the estimated kernels in figure 6?

sniklaus commented 4 years ago

I dumped all kernel coefficients and then inspected the per-pixel kernels using a GUI tool that I wrote. Of course, one needs to convolve the separable filter components with each other to get a regular 2D kernel.

issakh commented 4 years ago

How are you dumping the kernel coefficients, I've been doing this for each relevant layer I want to look at: kernels = model2.moduleUpsample2[1].weight.detach().cpu() and then plot using matplotlib. However, output seems to be suboptimal. Do you have any suggestions?

sniklaus commented 4 years ago

I am afraid that would print the learned weights of that layer you selected, not the predicted per-pixel separable kernel weights. Instead, what you need to dump is the output of:

issakh commented 4 years ago

Thanks for the response, this is the dump of the weights of the final layer of self.netHorizontal1(tenCombine). I can't seem to dump the weights of the entire sequential object. Sorry if it's taking me long to figure this out, I just want to do things properly

mygraph

sniklaus commented 4 years ago

I am afraid that you are still dumping the weights of the layer, not its prediction when given a specific input.

issakh commented 4 years ago

How would I be able to get the prediction? I can only get activation maps or kernel representations First two are input images, last is activation in1 in2 mygraph

sniklaus commented 4 years ago

Try changing: https://github.com/sniklaus/sepconv-slomo/blob/5696b52db9a60dd030f2d3f82604f48a88f4a258/run.py#L127-L128

To this:

tenVertical1 = self.netVertical1(tenCombine)
tenHorizontal1 = self.netHorizontal1(tenCombine)
tenVertical2 = self.netVertical2(tenCombine)
tenHorizontal2 = self.netHorizontal2(tenCombine)

torch.save(tenVertical1, 'vertical-1.pt')
torch.save(tenHorizontal1, 'horizontal-1.pt')
torch.save(tenVertical2, 'vertical-2.pt')
torch.save(tenHorizontal2, 'horizonta2-1.pt')

tenDot1 = sepconv.FunctionSepconv(tenInput=tenFirst, tenVertical=tenVertical1, tenHorizontal=tenHorizontal1)
tenDot2 = sepconv.FunctionSepconv(tenInput=tenSecond, tenVertical=tenVertical2, tenHorizontal=tenHorizontal2)
issakh commented 4 years ago

Thanks for your help, I have done this now, do you know of any good software to inspect the .pt files?

sniklaus commented 4 years ago

You can then write a script that loads the pt files again and visualizes the filter kernels that were used to synthesize an output pixel at an arbitrary (x, y).

issakh commented 3 years ago

Hi, I have a question about your WACV paper, you say " Specifically, we added residual blocks to the skip connections that join the two halves of the U-Net". Can you describe how you implemented this as I fail to understand how this has been done? Thanks

sniklaus commented 3 years ago

In the original architecture, the output from a block in the encoder is also fed to the respective block in the decoder as. https://github.com/sniklaus/sepconv-slomo/blob/46041adec601a4051b86741664bb2cdc80fe4919/run.py#L119

In the new architecture, among other changes, we transform the output from each block of the encoder before it is being fed to the corresponding block in the decoder . So the above would look more like the following.

tenDeconv4 = self.netUpsample4(self.netDeconv4(tenDeconv5 + resnet_block_5(tenConv5)))
issakh commented 3 years ago

Thanks for your response, that clarifies things a lot. For the resnet block you use, what does it consist of 1x1 conv + batchnorm or 2x(3x3 conv + batchnorm) or neither?

sniklaus commented 3 years ago

It is prelu + conv + prelu + conv without batchnorm (there is no batchnorm in any part of the old/new SepConv) and with the prelu initialized using 0.25 for the slope. I am hoping to release the code for SepConv++ but haven't gotten around to going through the approval process.

issakh commented 3 years ago

Thanks for that I will try it out

issakh commented 3 years ago

Hi, sorry to bother you, when you were submitting the middlebury evaluation results, how did you compute the time it took to run the urban dataset. It seems that there is no sort of standardised way to find this (leading to minor discrepancies depending on what was used), so it would be appreciated if you could help me with this Thanks

sniklaus commented 3 years ago

I called the network 1000 times, measured the time it took for each iteration (make sure to use torch.cuda.synchronize() to get valid timings), and reported the rounded median time. I wouldn't put too much into this metric on the Middlebury benchmark though since different methods use different hardware. To get comparable runtime measurements, you will have to redo the measurements for different methods yourself using the same hardware. Otherwise I can just use faster hardware and claim to be faster than all other methods.

issakh commented 3 years ago

Great thanks for that! So should I do something like this:

for i in range(1000):
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)

  start.record()
  output= run.estimate(frame1,frame2)
  end.record()

  # Waits for everything to finish running
  torch.cuda.synchronize()

  timeValue.append(start.elapsed_time(end))

numpy.mean(timevalue)

I really do appreciate your help and sorry for the excessive questions!

sniklaus commented 3 years ago

I have actually never used torch.cuda.Event, I just do something like the following.

runtimes = []
for i in range(1000):
    before = time.time()
    run.estimate(frame1, frame2)
    torch.cuda.synchronize()
    after = time.time()
    runtimes.append(after - before)
print(numpy.median(runtimes))
issakh commented 3 years ago

Thanks a lot for this!