irfanICMLL / structure_knowledge_distillation

The official code for the paper 'Structured Knowledge Distillation for Semantic Segmentation'. (CVPR 2019 ORAL) and extension to other tasks.
BSD 2-Clause "Simplified" License
694 stars 104 forks source link

Pixel-wise loss #45

Closed duanxuesong closed 3 years ago

duanxuesong commented 3 years ago

I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?

qiuhaining commented 3 years ago

I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?

I think it is CEloss

aye0804 commented 3 years ago

I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?

I think it is CEloss

But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'

qiuhaining commented 3 years ago

I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?

I think it is CEloss

But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'

Yes,it is! I think two ways to train the student model! When training the compact network, the teacher model is trained too, the pixel wise loss is KL loss!! otherwise, CE loss

aye0804 commented 3 years ago

I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?

I think it is CEloss

But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'

Yes,it is! I think two ways to train the student model! When training the compact network, the teacher model is trained too, the pixel wise loss is KL loss!! otherwise, CE loss

What you said makes sense.

chenwang1701 commented 3 years ago

I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?

I think it is CEloss

But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'

Yes,it is! I think two ways to train the student model! When training the compact network, the teacher model is trained too, the pixel wise loss is KL loss!! otherwise, CE loss

What you said makes sense.

Well, the loss is actually the CE loss. The relation is as KL(T,S)= Entropy(T)+CE(T,S), to optimize the parameters of Student network, the effectiveness of KL(T,S) loss is equal to CE(T,S) loss.