open-mmlab / mmagic

OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.
https://mmagic.readthedocs.io/en/latest/
Apache License 2.0
6.87k stars 1.05k forks source link

Doesn't generalize to other data #59

Closed jorenvs closed 4 years ago

jorenvs commented 4 years ago

I tried applying the video super resolution (EDVR) on other data, but I'm getting very weak results. The output barely seems to differ from the input in quality. Examples below (left is output, right is zoomed in input).

I tried both the EDVR_REDS_SR_L and the EDVR_Vimeo90K_SR_L models, with varying input sizes, getting similar results. Is this to be expected? I would guess given the REDS4 dataset was also mostly street scenes, it should at least perform similarly.

Screenshot 2020-03-07 at 22 11 36 Screenshot 2020-03-07 at 22 11 24 Screenshot 2020-03-07 at 22 11 15

The code I'm using (adapted from test_Vid4_REDS4_with_GT.py and moved to the root folder of the repos. Although I tested it on the REDS4 dataset with no issues.

Test Vid4 (SR) and REDS4 (SR-clean, SR-blur, deblur-clean, deblur-compression) datasets
'''

import sys
sys.path.insert(0, 'codes')

import os
import os.path as osp
import glob
import logging
import numpy as np
import cv2
import torch

import utils.util as util
import data.util as data_util
import models.archs.EDVR_arch as EDVR_arch

#################
# configurations
#################
device = torch.device('cuda')
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
data_mode = 'sharp_bicubic'  # Vid4 | sharp_bicubic | blur_bicubic | blur | blur_comp
# Vid4: SR
# REDS4: sharp_bicubic (SR-clean), blur_bicubic (SR-blur);
#        blur (deblur-clean), blur_comp (deblur-compression).
stage = 1  # 1 or 2, use two stage strategy for REDS dataset.
flip_test = False
############################################################################
#### model
model_path = 'experiments/pretrained_models/EDVR_REDS_SR_L.pth'

N_in = 5  # use N_in images to restore one HR image

predeblur, HR_in = False, False
back_RBs = 40
model = EDVR_arch.EDVR(128, N_in, 8, 5, back_RBs, predeblur=predeblur, HR_in=HR_in)

test_dataset_folder = 'datasets/streetscenes'

#### evaluation
crop_border = 0
border_frame = N_in // 2  # border frames when evaluate
# temporal padding mode
if data_mode == 'Vid4' or data_mode == 'sharp_bicubic':
    padding = 'new_info'
else:
    padding = 'replicate'
save_imgs = True

save_folder = 'results/streetscenes'
util.mkdirs(save_folder)
util.setup_logger('base', save_folder, 'test', level=logging.INFO, screen=True, tofile=True)
logger = logging.getLogger('base')

#### log info
logger.info('Data: {} - {}'.format(data_mode, test_dataset_folder))
logger.info('Padding mode: {}'.format(padding))
logger.info('Model path: {}'.format(model_path))
logger.info('Save images: {}'.format(save_imgs))
logger.info('Flip test: {}'.format(flip_test))

#### set up the models
model.load_state_dict(torch.load(model_path), strict=True)
model.eval()
model = model.to(device)

img_path_l = sorted(glob.glob(osp.join(test_dataset_folder, '*')))
max_idx = len(img_path_l)
if save_imgs:
    util.mkdirs(save_folder)

#### read LQ and GT images
imgs_LQ = data_util.read_img_seq(test_dataset_folder)

# process each image
for img_idx, img_path in enumerate(img_path_l):
    print(img_idx, img_path)
    img_name = osp.splitext(osp.basename(img_path))[0]
    select_idx = data_util.index_generation(img_idx, max_idx, N_in, padding=padding)
    print('select_idx:', select_idx)
    imgs_in = imgs_LQ.index_select(0, torch.LongTensor(select_idx)).unsqueeze(0).to(device)

    output = util.single_forward(model, imgs_in)
    output = util.tensor2img(output.squeeze(0))

    if save_imgs:
        cv2.imwrite(osp.join(save_folder, '{}.png'.format(img_name)), output)
nelaturuharsha commented 4 years ago

REDS trained EDVR needs to be run through two stages, so you could try passing the output from stage 1 through the stage 2 model. Other than that I believe there is a flip_test mode, which helps improve the quality.

jorenvs commented 4 years ago

Yeah I figured out it has two stages afterwards, I'll rerun the experiment again soon. The flip_test mode looks interesting, I guess an ensemble should increase generalisation a little.

adamsvystun commented 4 years ago

I can report a similar issue, I have tried using both stages, using flip_test, tested on a variety of videos. The model does not perform to the level it does on REDS4. It has multiple artifacts, and the quality overall is blurry.

adamsvystun commented 4 years ago

Okay, I solved my issue. The problem was the downsampling method. The datasets, that the model was trained on, were created by downsampling with MATLAB's imresize function. So if you generate input data with anything else (opencv, ffmpeg) - it doesn't work. You have to use MATLAB's imresize, or it's python's equivalent, which is implemented in this repo here.

jorenvs commented 4 years ago

Hmm, that's what I feared. That kind of defeats the purpose of super resolution. I don't want to downsample my data, I want to upsample it :).

nelaturuharsha commented 4 years ago

@adamsvystun Could you mention the exact flow that you used with the function you mentioned. did you basically send your input video through that method say if it H x W -> Target Resolution and subsequently pass it through EDVR to get the output. There were some weird blue-green artifacts during fast motion in my output, so I'm curious.

adamsvystun commented 4 years ago

@jorenvs It should work with upsampling. In my case I had a video in 720p and wanted to test 180p->720p upsampling, that's why I had to downsample. And it turns out that the model is very sensitive to the way you are doing this. If you have a video in low res only, it should just work.

@SreeHarshaNelaturu Yeah, for testing, I first downsample, then upsample with the model, and compare the results. Not sure about blue-green artifacts, I did not have any.

jorenvs commented 4 years ago

well, my videos are 1344x1344, so not really low quality. That's all relative to the angle of the lens of course, these are generated from 360° 5.6k gopro videos. The goal is to be able read far away text on traffic signs and such.

nelaturuharsha commented 4 years ago

Thank you for the prompt response @adamsvystun I was wondering about the part you'd mentioned about not using FFmpeg or cv2 to generate input data. What did you use to extract frames from the video to SR in your case to using those methods.

adamsvystun commented 4 years ago

@SreeHarshaNelaturu I said don't use FFmpeg or cv2 for downscaling (resizing down). For frame extraction you can use anything you want.

nelaturuharsha commented 4 years ago

Gotcha, I think the blue-green error is a consequence of something else. And yep, I was resizing via FFmpeg, might help to resize after extraction.

Thank you!

ryul99 commented 4 years ago

I'm not sure but I think downscaling matlab's method could differ from FFmpeg and cv2 In my case, EDVR works well with the bicubic downscaling method but it has artifact like this with others (ex: low-res videos from youtube). I guess that EDVR which trained with REDS dataset is overfitted about the reconstruction of bicubic downscaling as REDS dataset is consist of the bicubic downscaled dataset

xinntao commented 4 years ago

Yes, the current CNN-based methods does generalize to other datasets with different downsampling kernels.

There is another research filed called blind SR to solve this issue.