In SimpleKT research paper is written that the model uses ordinary dot-product attention, but in code in this repository I found that the implementation uses multi-headed attention. Do I get this correctly, that what is being used here is dot-product attention (that is attention module without trainable weights) run several (number of heads) times in parallel? Thank you for your help.
Hi,
In SimpleKT research paper is written that the model uses ordinary dot-product attention, but in code in this repository I found that the implementation uses multi-headed attention. Do I get this correctly, that what is being used here is dot-product attention (that is attention module without trainable weights) run several (number of heads) times in parallel? Thank you for your help.