namisan / mt-dnn

Multi-Task Deep Neural Networks for Natural Language Understanding
MIT License
2.24k stars 412 forks source link

SMART Implementation: SymKLCriterion vs stable_kl #222

Closed boubakerwa closed 3 years ago

boubakerwa commented 3 years ago

Hello,

I am wondering why there are two different implementations of $\ell_s$, the symmetric KL divergence in the code (SmartPerturbation):

first in the loop: adv_loss = stable_kl(adv_logits, logits.detach(), reduce=False) and then outside of the loop: adv_loss = adv_lc(logits, adv_logits, ignore_index=-1) where adv_lc is the SymKlCriterion that is implemented differently from stable_kl.

am I missing something? Really appreciate your help!

namisan commented 3 years ago

It doesn't matter much which criterion is used at the internal loop. However, it does matter for the outside which is used to reg the model. Note that the internal loop is to estimate adv samples.