Open sh0416 opened 2 years ago
Hi,
In the paper, the linear transformation is performed to match the hidden representation between student and teacher embeddings.
In the code, this is implemented using fit_dense, but this layer is instantiated only once.
fit_dense
It means that the linear transformation weight is shared through the layers, do I understand clearly?
Hi,
In the paper, the linear transformation is performed to match the hidden representation between student and teacher embeddings.
In the code, this is implemented using
fit_dense
, but this layer is instantiated only once.It means that the linear transformation weight is shared through the layers, do I understand clearly?