huawei-noah / Pretrained-Language-Model

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
3.02k stars 628 forks source link

teacher和student的hidden_size不同时,fit_size作用 #44

Closed littttttlebird closed 3 years ago

littttttlebird commented 4 years ago

假设teacher和student的hidden_size分别为d和d' 当d不等于d'时,利用student模型的fit_dense层,将d‘映射到和d一样的维度,使得student和teacher之间可以计算hidden_state loss。 但是当d和d'像当时,就可以不经过fit_dense映射直接计算hidden_state loss吧。但是代码里用了 if is_student判断,实际应该是判断d是否等于d'吧?

nlpBeginner commented 4 years ago

fit_dense层的作用是将TinyBERT的维度从d‘映射到d,当然,用d' == d判断也可以达到这个目的。这里用is_student 判断是为了让学生模型无论是d’是否在等于d的情况下都做一次线性变换。

littttttlebird commented 4 years ago

为什么d=d'时也要做一次变换呢?

shairoz-deci commented 3 years ago

Connecting to @chuanhuayang issue, a few questions about the fit dense matrix.

  1. why make it learnable, why not randomized as it serves merely as a linear projection and will not be used in inference.
  2. Why use a single matrix and not have a matrix per stage? Thanks