Describe the bug
It seems that there is a typo in Milti-Head markdown cell:
We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\in\mathbb{R}^{D\times dk}$, $W{1...h}^{K}\in\mathbb{R}^{D\times dk}$, $W{1...h}^{V}\in\mathbb{R}^{D\times d_v}$, and $W^{O}\in\mathbb{R}^{h\cdot dk\times d{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - Vaswani et al., 2017).
Here instead of $W^{O}\in\mathbb{R}^{h\cdot dk\times d{out}}$, it probably should say $W^{O}\in\mathbb{R}^{h\cdot dv\times d{out}}$
As the output is stacked V vectors of d_v dimensions.
Screenshots
If applicable, add screenshots to help explain your problem.
The screenshot from original paper:
Tutorial: 6
Describe the bug It seems that there is a typo in Milti-Head markdown cell:
We refer to this as Multi-Head Attention layer with the learnable parameters $W_{1...h}^{Q}\in\mathbb{R}^{D\times dk}$, $W{1...h}^{K}\in\mathbb{R}^{D\times dk}$, $W{1...h}^{V}\in\mathbb{R}^{D\times d_v}$, and $W^{O}\in\mathbb{R}^{h\cdot dk\times d{out}}$ ($D$ being the input dimensionality). Expressed in a computational graph, we can visualize it as below (figure credit - Vaswani et al., 2017).
Here instead of $W^{O}\in\mathbb{R}^{h\cdot dk\times d{out}}$, it probably should say $W^{O}\in\mathbb{R}^{h\cdot dv\times d{out}}$
As the output is stacked V vectors of d_v dimensions.
Screenshots If applicable, add screenshots to help explain your problem. The screenshot from original paper: