wangyx240 / High-Resolution-Image-Inpainting-GAN

Pytorch Re-implementation of "Contextual Residual Aggregation for Ultra High-Resolution Image Inpainting"(CVPR 2020 Oral)
75 stars 12 forks source link

This code is not runnable until I reproduce the first_MaskL1Loss after backproping the first_MaskL1Loss #7

Open yingqichao opened 3 years ago

yingqichao commented 3 years ago

In trainer.py, Line 170, there is:

loss = 0.5*opt.lambda_l1 * first_MaskL1Loss + opt.lambda_l1 * second_MaskL1Loss + GAN_Loss + second_PerceptualLoss * opt.lambda_perceptual

If you do not modify anything and run this program. The above line would cause:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Actually I have come into this annoying bug for several times using my PyTorch 1.8, CUDA 11.2 and NVIDIA RTX 3090. I suppose that after calling "first_MaskL1Loss.backward(retain_graph)" and "optimizer_g1.step()", the first_MaskL1Loss cannot be further used in the above total loss. I don't know the exact reason but I managed to run the code error-free by the following modification:

# Generator output
            for repeated_idx in range(2): # Added, when repeated_idx==0, call first_MaskL1Loss.backward(), otherwise skip it
                first_out, second_out = generator(img, mask)
                ...
                ...
                if repeated_idx % 2==0:
                    optimizer_g1.zero_grad()
                    first_MaskL1Loss.backward() # retain_graph=True
                    optimizer_g1.step()

           # the rest is not modified
           optimizer_g.zero_grad()
            # Get the deep semantic feature maps, and compute Perceptual Loss
            img_featuremaps = perceptualnet(img)  # feature maps

Another benefit of the above modification is that retain_graph=True is no longer required so that GPU memory can be saved.

Another issue: this project seems to cause error using DDP.

YingJiacheng commented 1 year ago

You are so niubility!!! Thanks a lot!