jsbaan / transformer-from-scratch

Well documented, unit tested, type checked and formatted implementation of a vanilla transformer - for educational purposes.
213 stars 41 forks source link

Returned q,k,v but assigned q,v,k ? Is there any difference? #1

Closed mokeyish closed 2 years ago

mokeyish commented 2 years ago

Thanks for your post, But I have a question about this:

https://github.com/jsbaan/transformer-from-scratch/blob/5f92dc570807dfe3e8033cd19bda1639672cb1c1/multi_head_attention.py#L97

https://github.com/jsbaan/transformer-from-scratch/blob/5f92dc570807dfe3e8033cd19bda1639672cb1c1/multi_head_attention.py#L59-L62

jsbaan commented 2 years ago

Hi mokeyish,

You are totally right, thanks for spotting this! I've corrected this here.

This is actually a great illustration of the point I am trying to make in my blogpost. The change above does effectively nothing; the model is working the same before and after this change. The reason is that it doesn't matter whether we call a chunk in the resulting projection matrix v or k, as long as we do it consistently.

Thanks again for spotting it!

mokeyish commented 2 years ago

I have read the transformer code of pytorch weeks ago, but find that the implemetion MultiHeadAttention of pytorch is not write with pure python. Thanks for your code😁 letting me understand. Looking forward to your new post of NLP.