When calculating Qr, why is the W of content used instead of the W of position used?

microsoft / DeBERTa

The implementation of DeBERTa

MIT License

1.91k stars 215 forks source link

When calculating Qr, why is the W of content used instead of the W of position used? #136

Open nebula303 opened 1 year ago

nebula303 commented 1 year ago

In disentangled_attention.py， pos_query_layer = self.transpose_for_scores(self.query_proj(rel_embeddings), self.num_attention_heads) .repeat(query_layer.size(0)//self.num_attention_heads, 1, 1)，why use self.query_proj to deal with rel_embeddings，shouldn't self.query_proj used to calculate query_layer only?