victorca25 / traiNNer

traiNNer: Deep learning framework for image and video super-resolution, restoration and image-to-image translation, for training and testing.
Apache License 2.0
293 stars 39 forks source link

PPON Error when moving to Phase 2 #36

Closed N0manDemo closed 3 years ago

N0manDemo commented 3 years ago

I was training a model with PPON (192) + MultiScale + Diffaug, and I receive the following error when moving to Phase 2: I have AMP disabled because my GPU doesn't support it. error.log

21-01-27 11:26:52.449 - INFO: Random seed: 0 21-01-27 11:26:52.647 - INFO: Dataset [LRHRDataset - DIV2K] is created. 21-01-27 11:26:52.647 - INFO: Number of train images: 37,933, iters: 2,371 21-01-27 11:26:52.647 - INFO: Total epochs needed: 43 for iters 100,000 21-01-27 11:26:52.648 - INFO: Dataset [LRHRDataset - val_set14_part] is created. 21-01-27 11:26:52.648 - INFO: Number of val images in [val_set14_part]: 1 21-01-27 11:26:52.650 - INFO: AMP library available 21-01-27 11:26:52.827 - INFO: Initialization method [kaiming] 21-01-27 11:26:54.127 - INFO: Initialization method [kaiming] 21-01-27 11:26:54.185 - INFO: Loading pretrained model for G [../experiments/pretrained_models/PPON_G.pth] ... 21-01-27 11:26:55.276 - INFO: Network G structure: DataParallel - PPON, with parameters: 17,267,657 21-01-27 11:26:55.277 - INFO: Network D structure: DataParallel - MultiscaleDiscriminator, with parameters: 8,296,899 21-01-27 11:26:55.277 - INFO: Model [PPONModel] is created. 21-01-27 11:26:55.277 - INFO: Start training from epoch: 0, iter: 0 21-01-27 11:26:55.991 - INFO: Switching to phase: p2, step: 1 Traceback (most recent call last): File "/mnt/ext4-storage/Training/BasicSR/codes/train.py", line 382, in main() File "/mnt/ext4-storage/Training/BasicSR/codes/train.py", line 378, in main fit(model, opt, dataloaders, steps_states, data_params, loggers) File "/mnt/ext4-storage/Training/BasicSR/codes/train.py", line 221, in fit model.optimize_parameters(virtual_step) # calculate loss functions, get gradients, update network weights File "/mnt/ext4-storage/Training/BasicSR/codes/models/ppon_model.py", line 199, in optimize_parameters l_g_total.backward() AttributeError: 'float' object has no attribute 'backward'

victorca25 commented 3 years ago

Hello! Can you share your options configuration file?

victorca25 commented 3 years ago

Ah, I didn't see the error.log. So for PPON, you need to configure the losses (type, weights, etc) as you would normally first and then pick which of the losses will be used for which stage. In your case, your configuration should look something like this:

pixel_criterion: l1 
pixel_weight: 1e-2
cx_weight: 0.5
cx_type: contextual
cx_vgg_layers: {conv_3_2: 1, conv_4_2: 1}
ssim_type: ms-ssim
ssim_weight: 1
ms_criterion: multiscale-l1
ms_weight: 1e-2
gan_type: vanilla
gan_weight: 0.005
p1_losses: ['pix']
p2_losses: ['pix-multiscale', 'ms-ssim']
p3_losses: ['contextual']

So you see pixel loss, multiscale pixel loss, multiscale SSIM and contextual loss are configured. Let me know if this fixes the problem.

N0manDemo commented 3 years ago

Thank you, that fixed the problem. I was missing quite a few options from the list.