16.2.7 Attention layers, definition of key, 2023-01-19 draft

probml / pml2-book

Probabilistic Machine Learning: Advanced Topics

MIT License

1.39k stars 119 forks source link

16.2.7 Attention layers, definition of key, 2023-01-19 draft #228

Closed dogandzic closed 1 year ago

dogandzic commented 1 year ago

I am definitely not an expert on this topic, but it seems to me that stored keys, $K = W^K X$ should actually be $K = X (W^K)^\top$ where $^\top$ is transposition. Then similar for other matrices such as $Q$ and $V$.

And the dimension of $\boldsymbol x$ should not be $d_k$ but $d$. The dimension $d_k$ should be related to $K$.

murphyk commented 1 year ago

I think you are right. I have made quite a few changes to this section, PTAL at the new version (to be posted soon).