perceptual loss based VAE

TokenJan commented 6 years ago

Architecture

This week I've implemented the proposed perceptual loss and appended it to the VAE. Here is the architecture of the whole network. The perceptual loss network is based on VGG 19 network. The original reference image as well as the decoded art and ref are input into the VGG 19 network. The well-trained VGG 19 network acts like a feature extractor and MSE loss is calculated between the reference and decoded images after certain layers (in our case 1, 4, 7 layers are selected) and sum them up as the perceptual loss. According to this and this paper, such perceptual loss could solve the blurry problem to some extent compared with pixel-by-pixel MSE loss.

perceptual loss

Results

Following are the results trained with patch size [48, 48] and perceptual loss. I think the details are preserved better than before. It doesn't predict very well on the noise image (like the last pair) - not quite sure what do these noise patches refer to, will investigate it later.

perceptual result 1 perceptual result 2

Improvements

implementing the perceptual loss network with VGG 19 is a bit tricky - because it only allows images with 3 channels, at least 48 x 48 size and in caffe format as inputs. So I concatenate our patches three times to simulate BGR images and also make some preprocessings before inputted into the VGG 19 network. Therefore, I'm thinking if it is a good idea to use our trained network to replace the VGG 19 as a feature extractor. Besides, I will also investigate into the multi-scale and 3D conv part.

thomaskuestner commented 6 years ago

Thanks for the update. You are right the results look better and the high frequency information is preserved. However, blocking/cartoon-like artifacts now appear which is also not favorable. I think a multi-scale network might help in this scenario, i.e. correcting the motion on different sized input patches (either in one network simultaneously or as an iterative approach). Also 3D estimation as you suggested might help to circumvent this a little bit.

Did you encounter any problems using the caffe framework on our servers, because I had never good experience with that? What is the current reconstruction time? I think we should try to replace the VGG19 by our own networks. As a starter you may also create and train your own network which is similar/or same to the VGG 19 and then start from there (you don't have to solely stick to the already trained networks).

TokenJan commented 6 years ago

@thomaskuestner The blocking/cartoon-like artifacts are really the main concern in this case - try to figure out the cause and how to avoid it later. As for the caffe format, what I mean is not the caffe framework but a certain image format which is the desired format for the input of VGG19.

caffe: will convert the images from RGB to BGR, then will zero-center each color channel with respect to the ImageNet dataset, without scaling.

I test the reconstruction time under the following environment and it takes 3 seconds to reconstruct 4800 patches with size 48 x 48

patch size: 48 x 48
batchSize: 128
trained images: t1_tse_tra_Kopf_0002 and t1_tse_tra_Kopf_Motion_0003 of 15 patients
sSplitting: normal
dSplitval: 0.1
patchOverlap: 0.5

thomaskuestner / CNNArt