Closed mokeyish closed 2 years ago
Hi mokeyish,
You are totally right, thanks for spotting this! I've corrected this here.
This is actually a great illustration of the point I am trying to make in my blogpost. The change above does effectively nothing; the model is working the same before and after this change. The reason is that it doesn't matter whether we call a chunk in the resulting projection matrix v or k, as long as we do it consistently.
Thanks again for spotting it!
I have read the transformer code of pytorch weeks ago, but find that the implemetion MultiHeadAttention of pytorch is not write with pure python. Thanks for your code😁 letting me understand. Looking forward to your new post of NLP.
Thanks for your post, But I have a question about this:
https://github.com/jsbaan/transformer-from-scratch/blob/5f92dc570807dfe3e8033cd19bda1639672cb1c1/multi_head_attention.py#L97
https://github.com/jsbaan/transformer-from-scratch/blob/5f92dc570807dfe3e8033cd19bda1639672cb1c1/multi_head_attention.py#L59-L62