microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 215 forks source link

Question regarding symmetric KL Loss #145

Open skbaur opened 7 months ago

skbaur commented 7 months ago

The way the symmetric KL Loss is implemented here (for sift loss) https://github.com/microsoft/DeBERTa/blob/4d7fe0bd4fb3c7d4f4005a7cafabde9800372098/DeBERTa/sift/sift.py#L180 differs from the symmetrized Kullback Leiber divergence, in particular it is not zero when both inputs are equal as would be expected (see https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Symmetrised_divergence ). In fact it seems to be equal to twice the entropy in that case (when the inputs are equal), which would intuitively lead to predictions of higher confidence.

Other implementations, see e.g., https://github.com/archinetai/smart-pytorch/blob/e96d8630dc58e1dce8540f61f91016849925ebfe/smart_pytorch/loss.py#L10, behave more like I would have expected it (from the name). Is there a reason to deviate from the more standard definition?