Open seanie12 opened 3 years ago
Hi, thanks for using the repo! We actually use a sum of hidden states from the meta net and the input to the meta net as the input of the next layer. So instead of learning a transformation directly, we the learn the difference between original hidden states and transformed hidden states. The reason why we do this is that we want push the transformation to happen only through the meta net.
Hi, thank you for sharing all the codes.
I have a question about bilevel optimization procedure.
why do we need detach() operation here?
As far as I understand the equation (3), we need to update the whole parameter of \theta. But if we detach the hidden state, then we consider it as constant, which means we cannot update the lower layer of BERT or Roberta.
Thank you.