Open pierrefournier752 opened 1 year ago
Hi @pierrefournier752,
Thanks for reaching out! No, it's a great question. Here are some thoughts:
I don't think there is A way to formulate a KL loss. Both options (forward and reverse KL) seem valid to me, and simply differ (from a learning perspective) in the gradients they yield. I can point you to an interesting discussion called "on the choice of KL divergence" from a Stanford class at https://ermongroup.github.io/cs228-notes/inference/variational/ in which both are presented.
Now with respect to the paper, \pi (in the paper) is the fixed distribution and p_\hat is the moving one. It's been a while now, but if I remember correctly, the choice of orders of input has little effect on the results, and I don't exactly remember why we chose one versus the other. Do not hesitate to share any thoughts or insights on that matter :)
Malik
Hi @mboudiaf Thanks a lot for your clarification (and links)! All is clear now.
Hi, thanks for your amazing work.
I have a (maybe naive) question about the order of the input arguments in the KL term of your loss. I thought that when you have a fixed distribution (such as the ground truth p), and a moving distribution (such as predictions s), the way to add this in a KL term was like this: -p log(q/p), so that this becomes a cross entropy (-p log q) and an entropy term -p log p (which becomes a constant as p is known/fixed). Nevertheless, it seems that as you present it, pi is the moving distribution and the B/F proportion predicted by the model is the fixed one (p log (p/pi)). Please can you clarify me this?