Closed ArsenLuca closed 4 years ago
Besides, as in the codes:
if args.pi == True:
temp = args.lambda_pi*self.criterion_pixel_wise(self.preds_S, self.preds_T, is_target_scattered = True)
self.pi_G_loss = temp.item()
G_loss = G_loss + temp
if args.pa == True:
#for ind in range(len(args.lambda_pa)):
# if args.lambda_pa[ind] != 0.0:
# temp1 = self.criterion_pair_wise_for_interfeat[ind](self.preds_S, self.preds_T, is_target_scattered = True)
# self.pa_G_loss[ind] = temp1.item()
# G_loss = G_loss + args.lambda_pa[ind]*temp1
# elif args.lambda_pa[ind] == 0.0:
# self.pa_G_loss[ind] = 0.0
temp1 = self.criterion_pair_wise_for_interfeat(self.preds_S, self.preds_T, is_target_scattered = True)
self.pa_G_loss = temp1.item()
G_loss = G_loss + args.lambda_pa*temp1
It seem that the weight of pixel-wise loss is args.lambda_pi(10) and the weight of pair-wise loss is args.lambda_pa(0.5), while in the paper they share the same weight lambda1(10) in Eq. 4 @irfanICMLL
Except what you mentioned for pixel distillation loss, I also find a strange setting for pixel distillation loss. https://github.com/irfanICMLL/structure_knowledge_distillation/blob/ce208e1e5ba9177ecfc42519a2c64148d396fb71/utils/criterion.py#L225 Pixel distillation loss is not divided by the batch size N, while other losses have been divided by N. Combined with what @ArsenLuca mentioned, pixel distillation loss is with a large loss weight: *lambda_pi N, which should be why the value of pixel distillation loss dominates the total loss value.**
The pixel-wise loss does not demonstrate generator training. If you try to remove the GAN and the pa loss, the results will become worse. The value of it is very large due to the new reproduce version. However, the main idea of using structure knowledge in knowledge distillation still works in this setting. We vary the loss weight, it will affect the results, but not much.
As paper states, the lambda_pi = 10, so the pixel-wise loss dominates during the generator training process. Is it OK?