Closed LorrinWWW closed 2 years ago
这两个应该不是一个概念吧,我理解的是,第一个get_new_layer_weight中的softmax应该是对应的原论文公式23,即是在更新每个layer weight omega,第二个应该是单纯为了归一化方便计算
The experiment did use two softmax, in order to make the weight smoother. Later, we found that use addition of weight and normalize can also achieve ideal results.
Thank you for your clarification!
Thanks for sharing the code! In emd_task_distill.py, you seem to perform softmax on the layer weights twice by default. Is it intended or do I misunderstanding anything?
First softmax: https://github.com/lxk00/BERT-EMD/blob/2e1062bf9c912e6d335bcc994d372e962fe262df/bert-emd/emd_task_distill.py#L384-L385
Second softmax: https://github.com/lxk00/BERT-EMD/blob/2e1062bf9c912e6d335bcc994d372e962fe262df/bert-emd/emd_task_distill.py#L424-L429