Open andrewliao11 opened 6 years ago
@andrewliao11 The file you mentioned is generated by _makemetadata.lua: https://github.com/silverbottlep/tvsn/blob/master/tvsn/code/make_metadata.lua
But I do not know how _testsetcar.t7 and _testsetchair.7 are generated.
I am also trying to re-implement TVSN. My L1 loss on chair category is 0.28, greater than 0.24 reported by the paper. I implemented it in python/pytorch. Everything is exactly the same. I am trying to figure it out. Any advices? Welcome to discuss with me.
Hi @liangzu , I'm also starting to implement the same in tensorflow (I have an implementation for the DOAFN network that does the appearance flow part and i'll build on top of that to add the generator and discriminator) It would be great if you can share the pytorch code as reference, because referring to that in case of doubts will be a lot easier than having to refer to torch code in lua that i'm not familiar with at all. Thanks!
For instance, one doubt that i have is, when normally doing deconvolution(/transposed convolution) as in the decoder part of the network, isn't it so that we usually initialize the convolutional kernel with weights of a bilinear upsampling operation? But torch's spatialfullconvolution() used here, doesn't seem to do that under the hood. So I just wanted to ask if the reason behind that is that we want to learn weights entirely from scratch without using bilinear init, since in this case you are also using feature maps from the encoder using skip connections. @liangzu @silverbottlep
Hi @gunshi , I didn't think that deep as you did. I just simply translated the code to the pytorch version, and then evaluated whether my implementation was correct. Weight initialization is done in https://github.com/silverbottlep/tvsn/blob/49820d819a3d3588e7c4ff3c3c9c4698a5593c53/tvsn/code/train_doafn.lua#L102 and of course in _traintvsn.lua. Since Deep Network and the design of its structure are quite emprical, I hardly ask why. Follow the practice of others, and compare my implementation with them, verify my doubts experimentally if I have any.
Below gives my code, AFN version. Pytorch has built-in support for differential sampling layer. Another version of my code is Direct version, i.e., generate novel view directly. My problems are, 1) the masked L1 loss in the Direct version is higher (0.28 on chair category) than reported (0.24), training and test loss is about 0.06. I trained it 600,000 iteration. 2) The training of AFN (code below) is difficult. Learning nothing, the network gives me output images with entirely white color (background). My initialization might be slightly different from _DOAFN_SYM256.lua, anything else is exactly the same (hopefully, see code below). what's wrong with me? Do you have any suggestions or tricks for training AFN? Thanks.
` class ImageEncoder(nn.Module): def init(self, feature_size): super(ImageEncoder, self).init()
self.img_encoder = nn.Sequential(
nn.Conv2d(3, 16, kernel_size=5, stride=2, padding=2), nn.ReLU(),
nn.Conv2d(16, 32, kernel_size=5, stride=2, padding=2), nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=5, stride=2, padding=2), nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1), nn.ReLU(),
nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1), nn.ReLU(),
nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1), nn.ReLU(),
net_utils.Flatten(), nn.Linear(4 * 4 * 512, feature_size), nn.ReLU()
)
def forward(self, img):
return self.img_encoder(img)
class NovelViewDecoderAFN(nn.Module):
def __init__(self, feature_size):
super(NovelViewDecoderAFN, self).__init__()
self.theta_encoder = nn.Sequential(
nn.Linear(17, 128), nn.ReLU(),
nn.Linear(128, 256), nn.ReLU()
)
self.refiner = nn.Sequential(
nn.Linear(256 + feature_size, feature_size), nn.ReLU(),
nn.Linear(feature_size, feature_size), nn.ReLU(),
nn.Linear(feature_size, 4 * 4 * 512), nn.ReLU()
)
self.flow_decoder = nn.Sequential(
nn.Upsample(scale_factor=2), nn.ConvTranspose2d(512, 256, kernel_size=3, stride=1, padding=1), nn.ReLU(),
nn.Upsample(scale_factor=2), nn.ConvTranspose2d(256, 128, kernel_size=3, stride=1, padding=1), nn.ReLU(),
nn.Upsample(scale_factor=2), nn.ConvTranspose2d(128, 64, kernel_size=3, stride=1, padding=1), nn.ReLU(),
nn.Upsample(scale_factor=2), nn.ConvTranspose2d(64, 32, kernel_size=5, stride=1, padding=2), nn.ReLU(),
nn.Upsample(scale_factor=2), nn.ConvTranspose2d(32, 16, kernel_size=5, stride=1, padding=2), nn.ReLU(),
nn.Upsample(scale_factor=2), nn.ConvTranspose2d(16, 2, kernel_size=5, stride=1, padding=2), nn.Tanh()
)
def forward(self, img, img_feature, theta):
theta = self.theta_encoder(theta)
combined_feature = torch.cat((img_feature, theta), 1)
refined_feature = self.refiner(combined_feature).view(-1, 512, 4, 4)
flow = self.flow_decoder(refined_feature)
flow = flow.permute(0, 2, 3, 1)
novel_view = F.grid_sample(img, flow)
return novel_view
class AFN(nn.Module):
def __init__(self, feature_size):
super(AFN, self).__init__()
self.img_encoder = ImageEncoder(feature_size)
self.novel_view_decoder = NovelViewDecoderAFN(feature_size)
def forward(self, img, theta):
img_feature = self.img_encoder.forward(img)
novel_view = self.novel_view_decoder.forward(img, img_feature, theta)
return novel_view
`
Hi, I'm not very familiar with pytorch but one thing i think you should look into is the upsampling. I see that you're first upsampling and then doing a transposed convolution? The aim is to do a learnable deconvolution(transposed convolution), and so convtranspose2d should be enough to do the same, without needing upsampling, since given the right parameters, it will do the upsampling on its own. (In context, my question was about whether to give an initialisation of weights that look like bilinear upsampling weights, to the kernel, when doing something like convtranspose2d, since that is usually considered a good starting point for the weights that will eventually be learnt for doing convtranspose2d)
For reference, take a look at: https://github.com/meetshah1995/pytorch-semseg/blob/772c771eecfc106f4c121bdc1c62061d1c2d8dc9/ptsemseg/models/utils.py (the deconv util method calls convtranspose2d directly with appropriate filter sizes, kernel sizes etc) This repo implements several semantic segmentation networks, in which a deconv operation is very common on the decoder end.
Another suggestion: The network i originally implemented was the appearance flow network from https://arxiv.org/abs/1605.03557, and tested it on KITTI sequences. This is the original appearance flow network paper , and the network used in the tvsn repo is very similar to the one they used. They were using Deconvolution function in Caffe, which under the hood initialises weights with bilinear upsampling coefficients. One quick thing you could try is to not use deconv(convtranspose2d) at all (only to test this out), and use Upsampling with 'bilinear' parameter, so that it purely bilinearly upsamples. See if that at least removes your white image problem. Then if that works to some extent, remove upsampling and try to make it work with convtranspose2d
Thank you for your detailed and executable suggestions! I definitely should try it! Seems that my network is trained with an improper initialization. convtranspose2d followed by a nearest-neighbor upsampling layer is a practice in https://github.com/silverbottlep/tvsn/blob/49820d819a3d3588e7c4ff3c3c9c4698a5593c53/tvsn/code/models/DOAFN_SYM_256.lua#L52 (notice that the stride is 1)
Here shares you my SSIM implementation if you need. Though it is tested and it outputs the same result as _image_errormeasures.lua, be careful to use it.
import numpy as np
from scipy.signal import convolve
# reference https://github.com/mubeta06/python/blob/master/signal_processing/sp/gauss.py
def fspecial_gauss(size, sigma):
"""Function to mimic the 'fspecial' gaussian MATLAB function
"""
x, y = np.mgrid[-size//2 + 1:size//2 + 1, -size//2 + 1:size//2 + 1]
g = np.exp(-((x**2 + y**2)/(2.0*sigma**2)))
return g/g.sum()
# This is a naive (no error cheking) python implementation of SSIM
# see https://github.com/silverbottlep/tvsn/blob/master/tvsn/code/image_error_measures.lua
# Input: numpy array, (256x256x3), ranging from (0, 1)
def SSIM(img1, img2):
gaussian_window = fspecial_gauss(11, 1.5)
K1 = 0.01
K2 = 0.03
L = 255
C1 = (K1 * L) * (K1 * L)
C2 = (K2 * L) * (K2 * L)
# rgb2gray, see https://github.com/torch/image/blob/5aa18819b6a7b44751f8a858bd232d1c07b67985/generic/image.c#L2105
img1 = 0.299 * img1[:, :, 0] + 0.5870 * img1[:, :, 1] + 0.1140 * img1[:, :, 2]
img2 = 0.299 * img2[:, :, 0] + 0.5870 * img2[:, :, 1] + 0.1140 * img2[:, :, 2]
# (0, 1) ->(0, 255)
img1 = img1 * 255
img2 = img2 * 255
mu1 = convolve(img1, gaussian_window, 'full')
mu2 = convolve(img2, gaussian_window, 'full')
mu1_sq = mu1 * mu1
mu2_sq = mu2 * mu2
mu1_mu2 = mu1 * mu2
sigma1_sq = convolve(img1 * img1, gaussian_window, 'full') - mu1_sq
sigma2_sq = convolve(img2 * img2, gaussian_window, 'full') - mu2_sq
sigma12 = convolve(img1 * img2, gaussian_window, 'full') - mu1_mu2
ssim_map = ((mu1_mu2*2+C1)*(sigma12*2+C2))/((mu1_sq+mu2_sq+C1)*(sigma1_sq+sigma2_sq+C2))
return np.mean(ssim_map)
Oh right, I had missed out the explicit upsampling part. Thanks
I had one doubt, @silverbottlep, it'd be great if you can clarify. In the following lines of code from train_tvsn.py (line 228), why is this being done? I originally thought we wanted to store and compare features of the first three layers of the discriminator for real/fake images, but later saw the fill(0). Is that because you later didn't want to do that comparison(did it not help training?), and so filled with zeros.
for l=1,opt.loss_layer do
table.insert(df_do,out[l]:clone():fill(0))
end
Thanks!
Hi @silverbottlep , Thanks for the great code! I'm right now trying to apply this code to my setting (different \theta, \phi, etc.), so I have to generate the dataset myself. I wonder how did you generate this file. It seems that it's should be a file containing the dataset split.
Thanks in advance!