Closed issakh closed 4 years ago
Can you provide more details on the training? Do you train with perceptual loss from scratch? Also, are you encountering this issue when training the provided model?
I have tried it in both scenarios, L1 then perceptual and training with perceptual from scratch, loss is very high around 1500 which seems weird. In terms of output image quality, it is very good, but with the brightness problem it isn't of much use. Could it do with the implementation of the loss? I'm using this implementation: https://github.com/martkartasev/sepconv
Could you provide an example input and its prediction before and after fine-tuning with perceptual loss?
The first image should be the one which was done using L1 then perceptual and the second just L1
There definitely is a brightness issue. What are the details of the perceptual loss that you are using?
It is this one (from https://github.com/martkartasev/sepconv/blob/master/src/loss.py):
class VggLoss(nn.Module):
def __init__(self):
super(VggLoss, self).__init__()
model = torchvision.models.vgg19(pretrained=True).cuda()
self.features = nn.Sequential(
# stop at relu4_4 (-10)
*list(model.features.children())[:-10]
)
for param in self.features.parameters():
param.requires_grad = False
def forward(self, output, target):
outputFeatures = self.features(output)
targetFeatures = self.features(target)
loss = torch.norm(outputFeatures - targetFeatures, 2)
return config.VGG_FACTOR * loss
Looks reasonable. You said you are already making sure to normalize the mean and the standard deviation when using this loss - are you also making sure that you have the right color channel ordering?
The image is RGB, which should be the right input, the only thing that would differ to the original vgg network is the input image size
The loss should have a sum
or a mean
somewhre, there is no reduction in torch.norm(...)
after all. Which one are you using?
I use the loss as attached as I thought it was complete, would something like this have to be implemented?
reg_loss = REGULARIZATION * (
torch.sum(torch.abs(y[:, :, :, :-1] - y[:, :, :, 1:])) +
torch.sum(torch.abs(y[:, :, :-1, :] - y[:, :, 1:, :]))
)
What you quoted is a regularization loss, which is different. I am afraid that I am unable to assist you on this further, there are too many unknowns without the entire source code. My apologies.
Hi, sorry to bother you, I think the issue I'm facing has to do with backpropagation of the sepconv layer (tried some other losses and a similar issue surfaced) , I appreciate if you could take a look at this backwards implementation I'm using: https://github.com/HyeongminLEE/pytorch-sepconv/blob/master/sepconv.py Really sorry for wasting your time, I have been trying to get things to work for such a long time and to narrow down the cause of the problem, so it would be much appreciated if you could help
The backwards implementation looks good to me, you could compare the numerical and analytical gradients to verify this further.
Hi, thanks for your help, I have managed to get it working. Do you by any chance have a link on how to calculate the statistical and error values for the Middlebury set?
Good news, congrats! There is a benchmark.py
, does that have everything you need?
Hi, what I'm looking for is the method to calculate these for the middlebury set: Endpoint, angle , interpolation , normalized interpolation errors, Statistics for endpoint error: Average , SD , R0.5 , R1.0 , R2.0 , A50 , A75 ,A95
Please see: http://vision.middlebury.edu/flow/submit/
I have taken a look at this and the code in the page, it is for saving the flow vectors which this sepconv algorithm doesn't compute. The other option available was to attach the interpolated frames (though not clear with the flow files as well or not). Did you go with the second option when submitting the evaluation results?
Yes, and most if not all interpolation papers have "Interpolation results only." stated in their description at the results page: http://vision.middlebury.edu/flow/eval/results/results-i1.php
I am not sure whether you really want to submit your results there though, make sure to read: "We will only be able to evaluate original, "final" results ready to be submitted to a conference or journal."
Thanks for your response, I wanted to know if I could replicate the results in some way, I guess, PSNR and SSIM is the closest I can do. I've been trying to familiarize myself with your paper and work well, as I am currently doing some work modifying it and trying to improve performance. Thanks again
Hi, in your paper you noted that your method needs 1.27GB of memory to interpolate a 1080p frame. What methodology did you use for calculating this? Thanks
I checked nvidia-smi
while running it on 1080p footage. This isn't very accurate though and only gives you a high water mark,. Nowadays, you can use torch.cuda.memory_summary()
for more detailed metrics but that didn't exist back when I ran the experiments. Note that this approach of measuring the memory usage also depends on the underlying convolution implementation. I might have also additionally set torch.cuda.cudnn.enabled = False
to measure the memory usage with PyTorch's own convolution implementation.
Thanks for that, I've just got one more question, how did you manage to get the estimated kernels in figure 6?
I dumped all kernel coefficients and then inspected the per-pixel kernels using a GUI tool that I wrote. Of course, one needs to convolve the separable filter components with each other to get a regular 2D kernel.
How are you dumping the kernel coefficients, I've been doing this for each relevant layer I want to look at:
kernels = model2.moduleUpsample2[1].weight.detach().cpu()
and then plot using matplotlib. However, output seems to be suboptimal. Do you have any suggestions?
I am afraid that would print the learned weights of that layer you selected, not the predicted per-pixel separable kernel weights. Instead, what you need to dump is the output of:
self.netVertical1(tenCombine)
self.netHorizontal1(tenCombine)
self.netVertical2(tenCombine)
self.netHorizontal2(tenCombine)
Thanks for the response, this is the dump of the weights of the final layer of self.netHorizontal1(tenCombine). I can't seem to dump the weights of the entire sequential object. Sorry if it's taking me long to figure this out, I just want to do things properly
I am afraid that you are still dumping the weights of the layer, not its prediction when given a specific input.
How would I be able to get the prediction? I can only get activation maps or kernel representations First two are input images, last is activation
Try changing: https://github.com/sniklaus/sepconv-slomo/blob/5696b52db9a60dd030f2d3f82604f48a88f4a258/run.py#L127-L128
To this:
tenVertical1 = self.netVertical1(tenCombine)
tenHorizontal1 = self.netHorizontal1(tenCombine)
tenVertical2 = self.netVertical2(tenCombine)
tenHorizontal2 = self.netHorizontal2(tenCombine)
torch.save(tenVertical1, 'vertical-1.pt')
torch.save(tenHorizontal1, 'horizontal-1.pt')
torch.save(tenVertical2, 'vertical-2.pt')
torch.save(tenHorizontal2, 'horizonta2-1.pt')
tenDot1 = sepconv.FunctionSepconv(tenInput=tenFirst, tenVertical=tenVertical1, tenHorizontal=tenHorizontal1)
tenDot2 = sepconv.FunctionSepconv(tenInput=tenSecond, tenVertical=tenVertical2, tenHorizontal=tenHorizontal2)
Thanks for your help, I have done this now, do you know of any good software to inspect the .pt files?
You can then write a script that loads the pt files again and visualizes the filter kernels that were used to synthesize an output pixel at an arbitrary (x, y).
Hi, I have a question about your WACV paper, you say " Specifically, we added residual blocks to the skip connections that join the two halves of the U-Net". Can you describe how you implemented this as I fail to understand how this has been done? Thanks
In the original architecture, the output from a block in the encoder is also fed to the respective block in the decoder as. https://github.com/sniklaus/sepconv-slomo/blob/46041adec601a4051b86741664bb2cdc80fe4919/run.py#L119
In the new architecture, among other changes, we transform the output from each block of the encoder before it is being fed to the corresponding block in the decoder . So the above would look more like the following.
tenDeconv4 = self.netUpsample4(self.netDeconv4(tenDeconv5 + resnet_block_5(tenConv5)))
Thanks for your response, that clarifies things a lot. For the resnet block you use, what does it consist of 1x1 conv + batchnorm or 2x(3x3 conv + batchnorm) or neither?
It is prelu + conv + prelu + conv
without batchnorm (there is no batchnorm in any part of the old/new SepConv) and with the prelu initialized using 0.25 for the slope. I am hoping to release the code for SepConv++ but haven't gotten around to going through the approval process.
Thanks for that I will try it out
Hi, sorry to bother you, when you were submitting the middlebury evaluation results, how did you compute the time it took to run the urban dataset. It seems that there is no sort of standardised way to find this (leading to minor discrepancies depending on what was used), so it would be appreciated if you could help me with this Thanks
I called the network 1000 times, measured the time it took for each iteration (make sure to use torch.cuda.synchronize()
to get valid timings), and reported the rounded median time. I wouldn't put too much into this metric on the Middlebury benchmark though since different methods use different hardware. To get comparable runtime measurements, you will have to redo the measurements for different methods yourself using the same hardware. Otherwise I can just use faster hardware and claim to be faster than all other methods.
Great thanks for that! So should I do something like this:
for i in range(1000):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
output= run.estimate(frame1,frame2)
end.record()
# Waits for everything to finish running
torch.cuda.synchronize()
timeValue.append(start.elapsed_time(end))
numpy.mean(timevalue)
I really do appreciate your help and sorry for the excessive questions!
I have actually never used torch.cuda.Event
, I just do something like the following.
runtimes = []
for i in range(1000):
before = time.time()
run.estimate(frame1, frame2)
torch.cuda.synchronize()
after = time.time()
runtimes.append(after - before)
print(numpy.median(runtimes))
Thanks a lot for this!
Hi, in your paper you mentioned the use of perceptual loss, did you apply any pre-processing to the training set as basically I'm getting a contrast issue despite performing normalisation using the imageNet mean and standard deviation values. I've tried it without doing this, but the output is more or less the same I'd appreciate it if you had any ideas as to where I'm going wrong in this case Thanks