visinf / irr

Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation (CVPR 2019)
Apache License 2.0
194 stars 32 forks source link

Gradient explosion after correcting the losses.py #37

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hi authors of irr, I discovered that there is a gradient explosion when training IRR_PWC after I pull the update in losses.py to solve this issue.

I trained for several times and gradient explosion still occurred. Below is the logbook.txt of the training process.

========================================== logbook.txt ==============================================

[2021-04-27 19:28:49] ==> Commandline Arguments [2021-04-27 19:28:49] batch_size: 4 [2021-04-27 19:28:49] batch_size_val: 4 [2021-04-27 19:28:49] checkpoint: None [2021-04-27 19:28:49] cuda: True [2021-04-27 19:28:49] evaluation: False [2021-04-27 19:28:49] lr_scheduler: MultiStepLR [2021-04-27 19:28:49] lr_scheduler_gamma: 0.5 [2021-04-27 19:28:49] lr_scheduler_last_epoch: -1 [2021-04-27 19:28:49] lr_scheduler_milestones: [54, 72, 90] [2021-04-27 19:28:49] model: IRR_PWC [2021-04-27 19:28:49] model_div_flow: 0.05 [2021-04-27 19:28:49] name: run [2021-04-27 19:28:49] num_iters: 1 [2021-04-27 19:28:49] num_workers: 4 [2021-04-27 19:28:49] optimizer: Adam [2021-04-27 19:28:49] optimizer_amsgrad: False [2021-04-27 19:28:49] optimizer_betas: (0.9, 0.999) [2021-04-27 19:28:49] optimizer_eps: 1e-08 [2021-04-27 19:28:49] optimizer_group: None [2021-04-27 19:28:49] optimizer_lr: 0.0001 [2021-04-27 19:28:49] optimizer_weight_decay: 0.0004 [2021-04-27 19:28:49] save: experiments/IRR_PWC-20210427-192845 [2021-04-27 19:28:49] save_result_bidirection: False [2021-04-27 19:28:49] save_result_flo: False [2021-04-27 19:28:49] save_result_img: False [2021-04-27 19:28:49] save_result_occ: False [2021-04-27 19:28:49] save_result_path_name: [2021-04-27 19:28:49] save_result_png: False [2021-04-27 19:28:49] seed: 1 [2021-04-27 19:28:49] start_epoch: 1 [2021-04-27 19:28:49] total_epochs: 108 [2021-04-27 19:28:49] training_augmentation: RandomAffineFlowOcc [2021-04-27 19:28:49] training_augmentation_addnoise: True [2021-04-27 19:28:49] training_augmentation_crop: None [2021-04-27 19:28:49] training_dataset: FlyingChairsOccTrain [2021-04-27 19:28:49] training_dataset_photometric_augmentations: True [2021-04-27 19:28:49] training_dataset_root: /mnt/lustre/share_data/longzeren/FlyingChairsOcc/data [2021-04-27 19:28:49] training_key: total_loss [2021-04-27 19:28:49] training_loss: MultiScaleEPE_PWC_Bi_Occ_upsample [2021-04-27 19:28:49] validation_augmentation: None [2021-04-27 19:28:49] validation_dataset: FlyingChairsOccValid [2021-04-27 19:28:49] validation_dataset_photometric_augmentations: False [2021-04-27 19:28:49] validation_dataset_root: /mnt/lustre/share_data/longzeren/FlyingChairsOcc/data [2021-04-27 19:28:49] validation_key: epe [2021-04-27 19:28:49] validation_key_minimize: True [2021-04-27 19:28:49] validation_loss: MultiScaleEPE_PWC_Bi_Occ_upsample [2021-04-27 19:28:49] ==> Random Seeds [2021-04-27 19:28:49] Python seed: 1 [2021-04-27 19:28:49] Numpy seed: 2 [2021-04-27 19:28:49] Torch CPU seed: 3 [2021-04-27 19:28:49] Torch CUDA seed: 4 [2021-04-27 19:28:49] ==> Datasets [2021-04-27 19:28:51] Training Dataset: FlyingChairsOccTrain [2021-04-27 19:28:53] input1: [3, 384, 512] [2021-04-27 19:28:53] input2: [3, 384, 512] [2021-04-27 19:28:53] target1: [2, 384, 512] [2021-04-27 19:28:53] target2: [2, 384, 512] [2021-04-27 19:28:53] target_occ1: [1, 384, 512] [2021-04-27 19:28:53] target_occ2: [1, 384, 512] [2021-04-27 19:28:53] num_examples: 22232 [2021-04-27 19:28:54] Validation Dataset: FlyingChairsOccValid [2021-04-27 19:28:54] input1: [3, 384, 512] [2021-04-27 19:28:54] input2: [3, 384, 512] [2021-04-27 19:28:54] target1: [2, 384, 512] [2021-04-27 19:28:54] target2: [2, 384, 512] [2021-04-27 19:28:54] target_occ1: [1, 384, 512] [2021-04-27 19:28:54] target_occ2: [1, 384, 512] [2021-04-27 19:28:54] num_examples: 640 [2021-04-27 19:28:54] ==> Runtime Augmentations [2021-04-27 19:28:54] training_augmentation: RandomAffineFlowOcc [2021-04-27 19:28:54] addnoise: True [2021-04-27 19:28:54] crop: None [2021-04-27 19:28:59] validation_augmentation: None [2021-04-27 19:28:59] ==> Model and Loss [2021-04-27 19:28:59] Initializing MSRA [2021-04-27 19:29:00] Batch Size: 4 [2021-04-27 19:29:00] GPGPU: Cuda [2021-04-27 19:29:00] Network: IRR_PWC [2021-04-27 19:29:00] Number of parameters: 6362092 [2021-04-27 19:29:00] Training Key: total_loss [2021-04-27 19:29:00] Training Loss: MultiScaleEPE_PWC_Bi_Occ_upsample [2021-04-27 19:29:00] Validation Key: epe [2021-04-27 19:29:00] Validation Loss: MultiScaleEPE_PWC_Bi_Occ_upsample [2021-04-27 19:29:00] ==> Checkpoint [2021-04-27 19:29:00] No checkpoint given. [2021-04-27 19:29:00] Starting from scratch with random initialization. [2021-04-27 19:29:00] ==> Save Directory [2021-04-27 19:29:00] Save directory: experiments/IRR_PWC-20210427-192845 [2021-04-27 19:29:00] ==> Optimizer [2021-04-27 19:29:00] Adam [2021-04-27 19:29:00] amsgrad: False [2021-04-27 19:29:00] betas: (0.9, 0.999) [2021-04-27 19:29:00] eps: 1e-08 [2021-04-27 19:29:00] lr: 0.0001 [2021-04-27 19:29:00] weight_decay: 0.0004 [2021-04-27 19:29:00] ==> Learning Rate Scheduler [2021-04-27 19:29:00] class: MultiStepLR [2021-04-27 19:29:00] gamma: 0.5 [2021-04-27 19:29:00] last_epoch: -1 [2021-04-27 19:29:00] milestones: [54, 72, 90] [2021-04-27 19:29:00] ==> Runtime [2021-04-27 19:29:00] start_epoch: 1 [2021-04-27 19:29:00] total_epochs: 108 [2021-04-27 19:29:00] [2021-04-27 19:29:00] ==> Epoch 1/108 [2021-04-27 19:29:00] lr: 0.0001 [2021-04-27 21:00:11] ==> Train: 100%|##########| 5558/5558 1:31:11<00:00 1.02it/s flow_loss_ema=192.9533, occ_loss_ema=52.6580, total_loss_ema=386.1001 [2021-04-27 21:01:06] ==> Validate: 100%|##########| 160/160 00:54<00:00 2.93it/s F1_avg=0.1559, epe_avg=70.1332 [2021-04-27 21:01:06] ==> Progress: 0%| | 0/108 1:32:06<? ?s/ep best_epe_avg=70.1332 [2021-04-27 21:01:06] Saved checkpoint as best model.. [2021-04-27 21:01:06] [2021-04-27 21:01:06] ==> Epoch 2/108 [2021-04-27 21:01:06] lr: 0.0001 [2021-04-27 21:53:11] ==> Train: 57%|#####7 | 3175/5558 52:05<39:05 1.02it/s flow_loss_ema=13349111.8987, occ_loss_ema=52.0983, total_loss_ema=26698223.7973 [2021-04-27 21:53:12] ==> Progress: 1%| | 1/108 2:24:12<257:09:48 8652.23s/ep best_epe_avg=70.1332

=================================================================================================

My software environment is as follow: python 3.7.6 torch 1.5.0 cuda v9.0.176

My hardware environment is as follow: CPU Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz GPU GeForce GTX 1080

Any help will be appreciated.

hurjunhwa commented 3 years ago

I will have a look at it!

hurjunhwa commented 3 years ago

Hi, I checked that it's properly working on either PyTorch 0.4.1 (original implementation setting) or PyTorch 1.5.0. Have you changed anything other than the loss function?

ghost commented 3 years ago

Thanks for your quick reply. I will have another try with a different machine.

hurjunhwa commented 3 years ago

I shortly revised source codes so that now it's compatible with pytorch 1.5.0. Maybe you can try to pull the recent commit and run it!

ghost commented 3 years ago

Thank you so much. I had another try yesterday evening on V100 machine with the previous code. However, the gradient explosion still occurred.

I will take a look at the updated code immediately !

ghost commented 3 years ago

It seems that gradient doesn't explode when training IRR-PWC after I updated the code !

Thanks you so much for your help !