irfanICMLL / structure_knowledge_distillation

The official code for the paper 'Structured Knowledge Distillation for Semantic Segmentation'. (CVPR 2019 ORAL) and extension to other tasks.
BSD 2-Clause "Simplified" License
694 stars 104 forks source link

pixel-wise loss dominates generator training #34

Closed ArsenLuca closed 4 years ago

ArsenLuca commented 4 years ago

image As paper states, the lambda_pi = 10, so the pixel-wise loss dominates during the generator training process. Is it OK?

ArsenLuca commented 4 years ago

Besides, as in the codes:

        if args.pi == True:
            temp = args.lambda_pi*self.criterion_pixel_wise(self.preds_S, self.preds_T, is_target_scattered = True)
            self.pi_G_loss = temp.item()
            G_loss = G_loss + temp
        if args.pa == True:
            #for ind in range(len(args.lambda_pa)):
            #    if args.lambda_pa[ind] != 0.0:
            #        temp1 = self.criterion_pair_wise_for_interfeat[ind](self.preds_S, self.preds_T, is_target_scattered = True)
            #        self.pa_G_loss[ind] = temp1.item()
            #        G_loss = G_loss + args.lambda_pa[ind]*temp1
            #    elif args.lambda_pa[ind] == 0.0:
            #        self.pa_G_loss[ind] = 0.0
            temp1 = self.criterion_pair_wise_for_interfeat(self.preds_S, self.preds_T, is_target_scattered = True)
            self.pa_G_loss = temp1.item()
            G_loss = G_loss + args.lambda_pa*temp1

It seem that the weight of pixel-wise loss is args.lambda_pi(10) and the weight of pair-wise loss is args.lambda_pa(0.5), while in the paper they share the same weight lambda1(10) in Eq. 4 @irfanICMLL

yuhuan-wu commented 4 years ago

Except what you mentioned for pixel distillation loss, I also find a strange setting for pixel distillation loss. https://github.com/irfanICMLL/structure_knowledge_distillation/blob/ce208e1e5ba9177ecfc42519a2c64148d396fb71/utils/criterion.py#L225 Pixel distillation loss is not divided by the batch size N, while other losses have been divided by N. Combined with what @ArsenLuca mentioned, pixel distillation loss is with a large loss weight: *lambda_pi N, which should be why the value of pixel distillation loss dominates the total loss value.**

irfanICMLL commented 4 years ago

The pixel-wise loss does not demonstrate generator training. If you try to remove the GAN and the pa loss, the results will become worse. The value of it is very large due to the new reproduce version. However, the main idea of using structure knowledge in knowledge distillation still works in this setting. We vary the loss weight, it will affect the results, but not much.