Question about detach()

microsoft / MetaXL

Meta Representation Transformation for Low-resource Cross-lingual Learning

MIT License

40 stars 7 forks source link

Question about detach() #4

Open seanie12 opened 3 years ago

seanie12 commented 3 years ago

Hi, thank you for sharing all the codes.

I have a question about bilevel optimization procedure.

why do we need detach() operation here?

As far as I understand the equation (3), we need to update the whole parameter of \theta. But if we detach the hidden state, then we consider it as constant, which means we cannot update the lower layer of BERT or Roberta.

Thank you.

xiamengzhou commented 3 years ago

Hi, thanks for using the repo! We actually use a sum of hidden states from the meta net and the input to the meta net as the input of the next layer. So instead of learning a transformation directly, we the learn the difference between original hidden states and transformed hidden states. The reason why we do this is that we want push the transformation to happen only through the meta net.