Closed LWprogramming closed 1 year ago
@LWprogramming very observant! this is actually using an updated technique from this paper
the technique was employed by both PaLM as well as AlphaCode. in other words, it will scale just fine, and save a ton of memory when doing decoding
Oh man another paper to my reading list 😂
I'll make a PR after I finish reading through the rest of the code to comment all these little optimizations that weren't around in the original paper
@LWprogramming haha yea, the field is a science, so a lot of literature
Hi @LWprogramming and @lucidrains,
Related to this topic, do you know why we omit the self-attention for the context in the transformer encoder before passing the context to the cross-attention?
In the implementation, I found that we directly pass the context to the cross-attention without doing self-attention as in the original paper: https://github.com/lucidrains/audiolm-pytorch/blob/1a888d2f462384baf5dc8b4782f39a40f59593b7/audiolm_pytorch/audiolm_pytorch.py#L503
Currently in the
Attention
class:This seems to learn a separate embedding -> query mapping per head, but embedding -> key or value would be the same across heads, while the original attention paper says
k
andv
should also be independent (section 3.2.2, bottom of page 4).