microsoft / UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech
Other
406 stars 71 forks source link

The code for computing the relative positional embedding #25

Closed mycrazycracy closed 2 years ago

mycrazycracy commented 2 years ago

Hi, I use this project for months and thank you for the excellent work!

These days I dive into the code and find something wrong with the multihead attention:

  1. It seems that the code for computing the positional embedding here is wrong? https://github.com/microsoft/UniSpeech/blob/main/WavLM/modules.py#L730 Since it updates the positional embedding with gru and passes it to the next layer. However, I think we should only pass the original embedding to the next layer rather than the one w/ gru.

  2. In the other branch, the positional embedding is correct while the input query is used for compute the attention mask. In the paper, we should use the projected query rather than the original one. Not sure is there anything I missed?

  3. The way to compute the relative positional embedding w/ gru seems different from the formula in the paper?

Could anyone please explain these questions?

Sanyuan-Chen commented 2 years ago

Hi @mycrazycracy ,

Thanks for your interest and detailed questions! Yes, you are right. These were some mistakes in the code of positional embedding when we were pre-training the WavLM models. For the consistency of pre-training and inference, we just released the exact implementation we used for pre-training.