The code for computing the relative positional embedding

Hi, I use this project for months and thank you for the excellent work!

These days I dive into the code and find something wrong with the multihead attention:

It seems that the code for computing the positional embedding here is wrong? https://github.com/microsoft/UniSpeech/blob/main/WavLM/modules.py#L730 Since it updates the positional embedding with gru and passes it to the next layer. However, I think we should only pass the original embedding to the next layer rather than the one w/ gru.
In the other branch, the positional embedding is correct while the input query is used for compute the attention mask. In the paper, we should use the projected query rather than the original one. Not sure is there anything I missed?
The way to compute the relative positional embedding w/ gru seems different from the formula in the paper?

Could anyone please explain these questions?

microsoft / UniSpeech