training questions - Githubissues

oldie77 commented 1 year ago

Hey there, thanks much for sharing your work, it's much appreciated!

I'm trying to train 4.6 from scratch, starting with a fixed 0.5 time only, and have some problems getting good results, using Vimeo triplets. So far I'm only at iteration 350k, so not far in yet, but for the first 250-300k iterations, results were very bad, with PSNR between 12-18. Finally, at about 280k, it started looking a bit better, but not really great yet, with PSNR at 23. Having looked at your training graphs, it seems you're getting much better results much earlier than me?

FWIW, I've combined the 4.6 model py files with the train.py + dataset.py files from this repository. In the 4.6 model files, I don't see the teacher distillation approach, anymore. May I ask if you've removed it or if you're still using it? I'm training without it atm, and I'm wondering if maybe that's why my training appears to work much worse than yours? I'm also training without context or refinement, if that makes a difference.

Also, may I ask which loss you're using now? 4.6 seems to use only L1 and Smooth loss, but no VGG Perceptual loss, anymore. Is that correct? In my own experiments with frame interpolation, VGG Perceptual loss helped, but I didn't use Smooth loss in my own experiments, so maybe perceptual loss isn't needed when smooth loss is used, I don't really know.

Thank you!! :)

hzwer commented 1 year ago

I do not manage the training code of v4.6 so please ignore the relevant content. My current loss function is 0.1 L1 loss + 1.0 VGG loss

I do not use the distillation loss now, but calculate and sum up l1 loss of each block of the IFNet

The VGG loss is calculated on the final result

My validation curve is like this:

My training batchsize is 64 with a learning rate of 1e-4

oldie77 commented 1 year ago

Thank you, I appreciate your reply! So you're not using smooth loss, anymore? And you're using VGG during training, but lpips during validation? Your validation curves look dramatically better than mine. I'm at 29 PSNR now after 800k iterations (but using only a batch size of 16). I'll try to train with your parameters now.

hzwer commented 1 year ago

Yes, smooth loss is not necessary, just fix some cases
Yes, VGG for training and LPIPS for validation
I think learning rate may be important

oldie77 commented 1 year ago

Thanks a ton, after changing all that, training looks dramatically better now. I think the biggest improvement was calculating L1 loss on all blocks, while the 4.6 official py files only calculate it on the final result.

Here are a few change suggestions to the 4.6 py files, to implement some of the changes you suggested:

old code:

            mask_list.append(mask)
            flow_list.append(flow)
            warped_img0 = warp(img0, flow[:, :2])
            warped_img1 = warp(img1, flow[:, 2:4])
            merged.append((warped_img0, warped_img1))
        mask_list[3] = torch.sigmoid(mask_list[3])
        merged[3] = merged[3][0] * mask_list[3] + merged[3][1] * (1 - mask_list[3])

new code:

            warped_img0 = warp(img0, flow[:, :2])
            warped_img1 = warp(img1, flow[:, 2:4])
            mask = torch.sigmoid(mask)
            mask_list.append(mask)
            flow_list.append(flow)
            merged.append(warped_img0 * mask + warped_img1 * (1 - mask))

old code:

        flow, mask, merged = self.flownet(torch.cat((imgs, gt), 1), scale=scale, training=training)
        loss_l1 = (merged[3] - gt).abs().mean()
        loss_smooth = self.sobel(flow[3], flow[3]*0).mean()
        # loss_vgg = self.vgg(merged[2], gt)
        if training:
            self.optimG.zero_grad()
            loss_G = loss_l1 + loss_cons + loss_smooth * 0.1

new code:

        flow, mask, merged = self.flownet(torch.cat((imgs, gt), 1), scale_list=scale, training=training)
        loss_l1 = (merged[0] - gt).abs().mean() + (merged[1] - gt).abs().mean() + (merged[2] - gt).abs().mean() + (merged[3] - gt).abs().mean()
        loss_smooth = self.sobel(flow[3], flow[3]*0).mean()
        loss_vgg = self.vgg(merged[3], gt)
        if training:
            self.optimG.zero_grad()
            #loss_G = loss_l1 * 0.1 + loss_smooth * 0.1 + loss_vgg * 1.0
            loss_G = loss_l1 * 0.1 + loss_vgg * 1.0

Two more questions, if you don't mind:

1) Are you using a fixed 1e-4 training rate, or are you still using cosine? 2) The 4.6 py file says: "self.version = 3.9". I assume this was just not changed yet, correct? Or does that mean that the 4.6 download contains old py files?

And one suggestion:

Currently, the way the py files are written, the L1 loss on the coarser levels is calculated in the original image resolution, so basically the coarser levels are interpolated up using Bilinear scaling and then the loss calculated against the full resolution groundtruth. I wonder if maybe calculating the loss in the smaller resolution of the coarser levels could improve accuracy a tiny bit. I'm not sure, though, it's just a thought, and it might actually end up being worse. But I thought I'd let you know this idea, in case you find it worth trying for 4.7.

hzwer commented 1 year ago

Yes, using at least one "deep supervision" method or "distillation" will make training a lot easier.

I'm still using cosine annealing.
Please ignore this as there is no temporal encoding before 3.9. Thanks for your suggestion. I'm currently researching how to improve LPIPS and will do some experimenting.

oldie77 commented 1 year ago

I've now done a training attempt with the following setup:

300 epochs
vimeo triplet, fixed 0.5 offset
learning rate: 2e-04, cosine to 2e-05
batchsize 64
224x224 patches
random upscaling of training images with up to 2.0 factor
teacher block with 64 channels
L1 loss for each block, including teacher block
VGG loss for the final result, and the teacher result
no distillation loss
lossG = loss_l1 0.1 loss_vgg 1.0 + loss_smooth * 0.01
validation done with last 5% of training images, not with extra test images
weight decay 1e-4

Here are my training results. Do these look ok to you? They seem a bit higher than the results you posted in this thread, but I've used 0.5 fixed ratio, your results are probably the variable ratio results (which I would expect to be a bit lower). However, I've seen you report higher PSNR results in other threads. So I wonder if my results are "good" or "sub-average"?

rife

Here's the random upscaling code I've used:

    def scale(self, img0, gt, img1):
        rnd = np.random.random()
        if rnd < 0.333:
            factor = 1.0
        elif rnd < 0.667:
            factor = np.random.random() * 1.0 + 1.0
        else:
            factor = 2.0
        img0 = cv2.resize(img0, None, fx = factor, fy = factor, interpolation = cv2.INTER_CUBIC)
        img1 = cv2.resize(img1, None, fx = factor, fy = factor, interpolation = cv2.INTER_CUBIC)
        gt   = cv2.resize(gt,   None, fx = factor, fy = factor, interpolation = cv2.INTER_CUBIC)
        return img0, gt, img1

Here's are the exact Python models I've trained with, just in case you want to look at them:

https://easyupload.io/cl2ysi

I'd like to try to improve the model. I have a few ideas what to improve. But before I start doing that, I'd like to make sure that I've maxed out the training of your original model, so I don't leave anything on the table. If I find improvements, I'll let you know.

I know that I could train for 2000 epochs instead of 300 epochs to get better results. But I wonder if there's anything else I might have missed to improve the results?

Thank you very much! :)

Edit: Using higher learning rates resulted in the training breaking down for me. At 5e-04, the teacher broke down, while the other training still worked. At 1e-03, everything broke down very quickly.

hzwer commented 1 year ago

Hello, I really admire your reproduction. I see no obvious problems.

I think the psnr metric hardly reflects the actual visual quality of the model. If you're interested, I recommend testing LPIPS (alexnet) on Vimeo90K, I got a number of 0.024. And the bilinear resize augmentation can improve the robustness of the model on real videos, but degrade the performance on original validation set.

For v4.6 model, I use this random upsampling method, just for your reference.

if np.random.uniform(0, 1) < 0.5:
    p = np.random.choice([1.5, 2.0])
    h, w = int(256 * p), int(448 * p)
    img0 = cv2.resize(img0, (w, h), interpolation=cv2.INTER_CUBIC)
    img1 = cv2.resize(img1, (w, h), interpolation=cv2.INTER_CUBIC)
    gt = cv2.resize(gt, (w, h), interpolation=cv2.INTER_CUBIC)

oldie77 commented 1 year ago

Hey there, looks like our random upscaling code is pretty similar. I've downloaded LPIPS from here:

https://github.com/richzhang/PerceptualSimilarity

But using alex version 0.1, my LPIPS results are quite different to yours, even when testing with the official 4.6.

Anyway, I've run many tests now to try to improve your model. Unfortunately, most attempts failed, but I found one useful improvement. Here are my tests in detail:

1) Comparing training with vs without teacher (only L1 + VGG, but no distillation), I've got better results without a teacher than with a teacher. Looking at the training graphs, my impression is that the untrained teacher slows training down at the start, and overall it ends up being worse. Do you get similar results?

2) I tried a training rate of 5e-4. That resulted in training breaking down a couple of times, but recovering nearly instantly. But overall results were worse than with 2e-4, which is what I'm using now.

3) I tried replacing LeakyReLU with Mish, because in my own model it seemed to work much better. But in your model, it's worse. I'm not sure why.

4) I tried simply removing the "sigmoid" for the mask, but it was worse.

5) I tried warping in each block's resolution instead of in full image resolution, but it was worse.

6) The one thing that improved results was working with features instead of RGB channels. Basically I've run 2 conv layers in full image resolution with 16 layers (3x3 kernel) to extract features. I've done that separately for each image, so I get 2x16 channels instead of 2x3 channels. And then when warping the image, and then downscaling, in order to feed them into the blocks, I warped the 2x16 feature channels instead of the 2x3 RGB channels. The only place where I use the RGB data is at the very beginning to extract the features, and at the very end to create the final output image.

Here are my training results in detail:

official 4.6:
last 5% of training data: psnr: 34.03581566114612 l1: 0.052352276 vgg: 0.103010334 lpips: 0.018232213
test data:                psnr: 33.69184282428702 l1: 0.053055000 vgg: 0.108276725 lpips: 0.018533938

with teacher 300 epochs:
last 5% of training data: psnr: 34.13369765946286 l1: 0.067492900 vgg: 0.101736955 lpips: 0.018076574
test data:                psnr: 33.74516159999847 l1: 0.066869326 vgg: 0.107908260 lpips: 0.018720835

without teacher 300 epochs:
last 5% of training data: psnr: 33.78854845043220 l1: 0.070791350 vgg: 0.100611010 lpips: 0.017303353
test data:                psnr: 33.45159814784111 l1: 0.071070260 vgg: 0.106683980 lpips: 0.017895760

without teacher 150 epochs:
last 5% of training data: psnr: 33.77262763970929 l1: 0.067497850 vgg: 0.104528606 lpips: 0.019010283
test data:                psnr: 33.44365577226245 l1: 0.067496310 vgg: 0.110193080 lpips: 0.019605996

without teacher + 5e-4 learning rate 150 epochs:
last 5% of training data: psnr: 33.73466186635918 l1: 0.084015390 vgg: 0.105224010 lpips: 0.019110667
test data:                psnr: 33.41221497565082 l1: 0.082454760 vgg: 0.110920615 lpips: 0.019843696

without teacher + mish 150 epochs:
last 5% of training data: psnr: 33.72107811950534 l1: 0.066536780 vgg: 0.107533970 lpips: 0.020373814
test data:                psnr: 33.35827031394503 l1: 0.066213940 vgg: 0.113136570 lpips: 0.020898119

without teacher + features instead of RGB 150 epochs:
last 5% of training data: psnr: 33.75065112422423 l1: 0.084101520 vgg: 0.104246360 lpips: 0.018546844
test data:                psnr: 33.36340705020442 l1: 0.082589360 vgg: 0.110559830 lpips: 0.019475210

without teacher + warp in block resolution 150 epochs:
last 5% of training data: psnr: 33.31564827725716 l1: 0.087066350 vgg: 0.108045734 lpips: 0.019856920
test data:                psnr: 33.08397680394352 l1: 0.085275225 vgg: 0.114375435 lpips: 0.020565430

without teacher + no sigmoid 150 epochs:
last 5% of training data: psnr: 33.72379817367946 l1: 0.071349970 vgg: 0.105009180 lpips: 0.019180637
test data:                psnr: 33.38703426598215 l1: 0.071909570 vgg: 0.110685520 lpips: 0.019879563

Please note that I've run most of the tests with only 150 epochs, so not all the numbers are directly comparable. And I've trained with a fixed 0.5 offset, so the results are also not directly comparable to the official 4.6.

yanerzidefanbaba commented 1 year ago

excuse me, where is the file v4.6 you discussed about, I could only find v3

oldie77 commented 1 year ago

See here:

https://github.com/hzwer/Practical-RIFE

laugam commented 1 year ago

Hi oldie77, if possible, could you share your training code for the v4.6 model? It would be really helpful. Many thanks!

hzwer / ECCV2022-RIFE

training questions #293