Closed duanxuesong closed 3 years ago
I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?
I think it is CEloss
I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?
I think it is CEloss
But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'
I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?
I think it is CEloss
But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'
Yes,it is! I think two ways to train the student model! When training the compact network, the teacher model is trained too, the pixel wise loss is KL loss!! otherwise, CE loss
I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?
I think it is CEloss
But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'
Yes,it is! I think two ways to train the student model! When training the compact network, the teacher model is trained too, the pixel wise loss is KL loss!! otherwise, CE loss
What you said makes sense.
I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?
I think it is CEloss
But, in the paper,the author say they 'use the class probabilities produced from the teacher model as soft targets for training the compact network'
Yes,it is! I think two ways to train the student model! When training the compact network, the teacher model is trained too, the pixel wise loss is KL loss!! otherwise, CE loss
What you said makes sense.
Well, the loss is actually the CE loss. The relation is as KL(T,S)= Entropy(T)+CE(T,S), to optimize the parameters of Student network, the effectiveness of KL(T,S) loss is equal to CE(T,S) loss.
I noticed this part of the code," loss = ( torch.sum ( _- softmax_pred_T * logsoftmax ( predsS [0].permute(0,2,3,1). contiguous() .view (-1,C))))/W/H". Is this a representation of KL divergence?