Question regarding symmetric KL Loss

The way the symmetric KL Loss is implemented here (for sift loss) https://github.com/microsoft/DeBERTa/blob/4d7fe0bd4fb3c7d4f4005a7cafabde9800372098/DeBERTa/sift/sift.py#L180 differs from the symmetrized Kullback Leiber divergence, in particular it is not zero when both inputs are equal as would be expected (see https://en.wikipedia.org/wiki/Kullback–Leibler_divergence#Symmetrised_divergence ). In fact it seems to be equal to twice the entropy in that case (when the inputs are equal), which would intuitively lead to predictions of higher confidence.

Other implementations, see e.g., https://github.com/archinetai/smart-pytorch/blob/e96d8630dc58e1dce8540f61f91016849925ebfe/smart_pytorch/loss.py#L10, behave more like I would have expected it (from the name). Is there a reason to deviate from the more standard definition?

microsoft / DeBERTa