lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.63k stars 395 forks source link

cascading transformer comment typo #191

Closed p0p4k closed 1 year ago

p0p4k commented 1 year ago

Sorry if I am misunderstanding cascading head paper, in the paper (pdf link), they are adding the previous head's attn output to next head's input (X'ij = Xij + ~Xi(j-1)) eq -3. This head's input will be then mapped into k,q,v through their respective linear layers. But why do we add it only to query in this repo's implementation? Is it because it might break kv-cache, etc. Sorry, if this is too noob of a question, I am still learning. Thanks. image

lucidrains commented 1 year ago

@p0p4k ah hey, yea that doesn't look right

i'm going to remove it anyways

p0p4k commented 1 year ago

Okay, got it! Very interesting repo, I am going through it line by line, really like your style of coding.

p0p4k commented 1 year ago

Not 100% readable xD , but it is quite "smart" and inspiring. I am taking notes on a Jupyter notebook on the side which can complement this repo for slow learners like me.