Open music-dino opened 1 month ago
The computation details of the Multi-Head Attention can be found in this paper: https://arxiv.org/abs/1706.03762
An example about how to implement the behavior of the MultiHeadAttention operator: https://github.com/microsoft/onnxruntime/issues/19924
Useful articles about Transformers and Attention: https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452
https://github.com/ROCm/AMDMIGraphX/pull/3425