Open devalexqt opened 3 years ago
Hi devalexqt, We worked on this project when stable tf2 versions were not released yet. Unfortunately, because tf2 does not rely on session-based coding structure, it may be required to refactor the whole parts of the codes for migrating to the tf2 version.
Thanks!
Hi! Do you have time to look on my tf2 implementation of vmnet? I have couple of questions?
@devalexqt Sure, it would be great if I could help.
Let me some time to prepare code.
@devalexqt No problem! Please just ping me when your code is ready. Also, if you want to remain your code in private mode (instead of public), please just add me to your repository as a collaborator, so I can check your code and discuss further in your repo. :)
Can you please explain we we need to use RGB (127.5, 127.5, 127.5) mean instead of full range 255?
Can you please explain we we need to use RGB (127.5, 127.5, 127.5) mean instead of full range 255?
When you want to follow the same training procedure as in our paper, each pixel value needs to be adjusted to have a range of -127.5 ~ 127.5, by subtracting each pixel value (with a range of 0 ~ 255) by 127.5.
https://github.com/idearibosome/tf-vmnet/blob/main/models/vmnet.py#L281
Also, don't forget to add 127.5 of each pixel value of the output, so that the final output image has the range of 0 ~ 255 for each pixel.
This so-called "mean shifting" process is common in various super-resolution models. But if you are writing a code to train your model by yourself, that procedure is actually not mandatory (the performance difference can be marginal).
So, basically if we use mean shifting we operate with -1...1 range instead of 0...1 for (0...255), but why? Looks like I get similar results with my tf2 implementation. Soon will be share code.
So, basically if we use mean shifting we operate with -1...1 range instead of 0...1 for (0...255), but why?
This is a common approach in many deep learning models. Some researches found that when they rescale the image pixel values to like -1...1, the performance was slightly better than just using 0...1 range. But I currently don't remember the papers that explain this. But this effect may not be so big.
More precisely, some super-resolution models subtract R, G, B channels with the mean pixel values of the entire DIV2K dataset (like EDSR).
https://github.com/limbee/NTIRE2017/issues/19
Looks like I get similar results with my tf2 implementation. Soon will be share code.
Great news!
I just send invite to you.
@devalexqt Got it. I accepted the invitation and looked into your code briefly (especially for train.py
and vmnet.py
). The overall structure of the VMNet looks good.
I found that there are two possible differences.
First is about reusing weights of the vmb_block
. The VMB should be defined only once and it should be used multiple times.
In tf1, this can be handled by tf.variable_scope
with the reuse
argument like this:
https://github.com/idearibosome/tf-vmnet/blob/main/models/vmnet.py#L308
In your Keras-based implementation, this can be done by creating layers.Conv2D
instances first (e.g., in your vmnet()
function), and then reuse them by calling the already created layers.Conv2D
.
Reference: https://github.com/keras-team/keras/issues/10747
For example, change this code
def vmb_block(x, state):
x=layers.Concatenate()([x, state])
x=layers.Conv2D(filters*2, kernel_size, padding="same", activation="relu")(x)
x=layers.Conv2D(filters*2, kernel_size, padding="same", activation=None)(x)
...
def vmnet(...):
...
for i in range(num_vmb_blocks):
x,state = vmb_block(x,state)
...
to something like this:
def vmb_block(x, state, conv_layers):
x=layers.Concatenate()([x, state])
x=conv_layers[0](x)
x=conv_layers[1](x)
...
def vmnet(...):
...
conv_layers = []
conv_layers.append(layers.Conv2D(filters*2, kernel_size, padding="same", activation="relu"))
conv_layers.append(layers.Conv2D(filters*2, kernel_size, padding="same", activation=None))
...
for i in range(num_vmb_blocks):
x,state = vmb_block(x, state, conv_layers)
...
I didn't check exactly whether the above code is right or wrong, so please check if the above code "truly" creates VMB part only once by calling model.summary()
or something others.
Second is about obtaining output images during training. In our implementation, our model obtains output images from all VMBs, resulting in obtaining a total of 16 output images.
https://github.com/idearibosome/tf-vmnet/blob/main/models/vmnet.py#L364
Then, these images are combined like this:
X = ((1 * X_1) + (2 * X_2) + (4 * X_3) + (8 * X_4) + ... + (32768 * X_16)) / (1 + 2 + 4 + 8 + ... + 32768)
where X_1
is the output image obtained from the first VMB call, X_2
is the output image obtained from the second VMB call, etc.
This process ensures that the optimizer considers the intermediate VMB outputs of the VMNet during training.
Note that after the training (during testing), the output is just X = X_16
.
Feel free to ask any other questions :)
Thanks for replying. I will change code and inform you.
I changed code and now testing and validating... Can you tell me how much time the original training process take? I try to remove jpeg artifacts from image, it's removed, but image is blurry, so is it possible to decrease blur effect?
Can you tell me how much time the original training process take?
It depends on the configuration and GPU types. In our experiment, when we use a patch size of 32 px (= 128 px in high resolution), it took about 3~4 days to run 1,000,000 iterations with a batch size of 8 (= about 10000 epochs) on a NVIDIA GTX 1080 GPU.
I try to remove jpeg artifacts from image, it's removed, but image is blurry, so is it possible to decrease blur effect?
Can you measure PSNR and SSIM values of the output images? Comparing PSNR or SSIM values can be a way to find out whether your model is sufficiently trained or not.
The meaning of "blurry" can be subjective.. But there are several tips to improve performance in the literature as follows:
Reference: https://arxiv.org/abs/1809.00219
It is known that a larger patch size can improve the super-resolution performance. But it may also increase the training time, so there may be some trade offs between time complexity and output performance.
Our VMNet model is basically focused on minimizing the model size while preserving PSNR-oriented performance. That's why we modeled our structure to use only one VMB, instead of stacking multiple VMBs.
There are some other approaches to improve perceptual quality by employing so-called Generative Adversarial Networks (GANs). ESRGAN is a good example.
https://github.com/xinntao/ESRGAN
We also developed another super-resolution model that considers both perceptual and PSNR-oriented quality measures.
https://github.com/idearibosome/tf-perceptual-eusr
But please note that the above models are pretty much larger than VMNet.
Thanks, I will try.
This is before and after images.(with jpeg artifacts)
I found Interesing paper https://arxiv.org/pdf/1511.08861.pdf about using PSNR + SSIM as loss function. I need to deep dive in to it and try to implement.
This is before and after images.(with jpeg artifacts)
How many iterations (or epochs) did you run to get this result? Also, I think using training images from your own dataset instead of DIV2K may be better to get better results for your target images.
Actually, I think this is a very interesting topic: increasing resolution along with removing JPEG artifacts. Very recently (about several months ago), there was a challenge for solving image deblurring + super-resolution simultaneously: https://arxiv.org/abs/2104.14854
I hope this also can help you to find another direction :)
For quick testing with high jpeg artifacts images I use 10 min execution (10 epoch with 1000 steps per epoch) with "mean shifting". Because, previously without "mean shifting" and with such small run time result is far away and require 3+ days to running to get similar result on Quadro RTX 4000 GPU. Alslo, I thinking how to embed "debluring" for inside network to remove blurry result, so thank for attach paper!
I found that "mean shifting" is game changer for such task!
For quick testing with high jpeg artifacts images I use 10 min execution (10 epoch with 1000 steps per epoch) with "mean shifting". Because, previously without "mean shifting" and with such small run time result is far away and require 3+ days to running to get similar result on Quadro RTX 4000 GPU. I found that "mean shifting" is game changer for such task!
Oh really? That's very interesting!
``Yea! Without "mean shifting" and 10 min run I see next effect that in light part of test image artifacts starting to removing but in dark part of image artifacts not removed. I think is because light part is close to 1 but dark part is close to 0. in image. Please look at top left corner with sky (light part) and hand (dark part).
Can you share code for paper?
``Yea! Without "mean shifting" and 10 min run I see next effect that in light part of test image artifacts starting to removing but in dark part of image artifacts not removed. I think is because light part is close to 1 but dark part is close to 0. in image. Please look at top left corner with sky (light part) and hand (dark part).
Ah, I see. This may be related to the ReLU function inside the layers. But I did not test our model without employing mean shifting, so that much difference is not expected. :)
Can you share code for paper?
https://arxiv.org/abs/2104.14854 this one? This is not our work and it is about the summary of that deblurring + super-resolution challenge. https://competitions.codalab.org/competitions/28073
Maybe this is the code for the 1st ranked solution? - https://github.com/zeyuxiao1997/EDPN
Unfortunately, almost all codes for super-resolution are written in PyTorch, not TensorFlow...
Initially I tested you original model and got same result as my code with mean sifting.
Thank, I will read papers.
I try to run with modified loss function: mean absolute error +ssim
def call(self, y_true, y_pred):
l1=tf.reduce_mean(tf.losses.mean_absolute_error(y_pred,y_true))
ssim=tf.clip_by_value(ssim_metric(y_true,y_pred),0,1)
loss=1-tf.reduce_mean(tf.clip_by_value(tf.image.ssim(y_true, y_pred, 2.0),0,1))+l1
return loss
And for 10 min training run I got more blurry result, but then I run for more time up to 12h (800epoch*1000steps) I got less and less blurry result. If I run it for more time it maybe outperform just "mean absolute error" loss.
Also, can you help me to find paper about "mean shifting"?
Thank you for sharing your results. Because SSIM considers perceptual quality slightly more than PSNR, the visual quality may be better.
Also, can you help me to find paper about "mean shifting"?
I still cannot find the paper about discussing mean shifting, but these links address that part.
Note that mean shifting is basically a very common approach to adjust input data in various deep learning models (beyond super-resolution models).
Also, DIV2K contains natural images, so I think training with your own dataset that contains much similar images to your testing images (maybe shooting game screenshots?) may be beneficial, because those images may contain more related textures than the natural textures (captured by digital cameras).
Hi! First of all, very good job and good results with less blur. But why not use tensorflow2 ?