Open xiongtx opened 3 months ago
Thanks for the comment, and I 100% agree. Not sure why I made it unnecessarily complicated there. In my other book (Build an LLM from Scratch), I am using the more legible version similar to what you suggest: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch03/01_main-chapter-code/ch03.ipynb
I find the matrix operations in Chpt. 16 confusing. For example, instead of:
It's clearer to do:
W/ multi-head attention, instead of:
we can just do:
which makes it clear that the multihead case is analogous to the single-head case.