lucidrains / x-transformers

A concise but complete full-attention transformer with a set of promising experimental features from various papers
MIT License
4.71k stars 402 forks source link

forgetting transformers #281

Open faresobeid opened 3 hours ago

faresobeid commented 3 hours ago

Maybe can try https://openreview.net/forum?id=q2Lnyegkr8. Also for the value residual implementation I see u dont average the values, not sure if thats on purpose, and form the paper it seems like resformer + neutreno is best but im unsure how that looks exactly

lucidrains commented 3 hours ago

@faresobeid yes, i was just reading "forgetting transformer" yesterday but haven't fully digested it yet. i'll need two weeks for this paper

oh, was value residual averaged? and yea i noticed they had a combo that worked better, but i'd need another hyperparameter for that (to specify resformer only, neutreno only, both)

faresobeid commented 3 hours ago

Ya the forgetting transformer is basically mamba 2 style decay but on the attn mask, seems cool. For resformer + neutreno is that just doing both in ur code like that would u say?

lucidrains commented 2 hours ago

@faresobeid yea, except it should be more powerful than associative scan based rnn, as you get a separate row per token

ohh, i see what you mean. they won't be averaged, as the two papers apply the residual in different places

lucidrains commented 2 hours ago

@faresobeid but yip, i'm open to adding it once i see a positive experiment. think i will for this paper

faresobeid commented 2 hours ago

@faresobeid yea, except it should be more powerful than associative scan based rnn, as you get a separate row per token

Hm not sure wym as mamba 2 decay is data dependent

faresobeid commented 2 hours ago

@faresobeid but yip, i'm open to adding it once i see a positive experiment. think i will for this paper

Ya i mean tbf it is basically data dependent alibi so must be better than alibi at least

lucidrains commented 2 hours ago

@faresobeid yea, except it should be more powerful than associative scan based rnn, as you get a separate row per token

Hm not sure wym as mamba 2 decay is data dependent

in transformers, each token will aggregate other tokens with their own separate forgetting pattern

that isn't the case with associative scan rnn

faresobeid commented 2 hours ago

Well technically in the attn view they do the same thing here. You can imagine at inference ur row will all be multiplied by F then append the new token. Rnns do the same but issue is they add these values rather than concatenate (trade off), but forgetting should work the same (obv the forgetting here will learn different patterns bcs theres no compression so would forget more compared to if used in an rnn)

faresobeid commented 2 hours ago

On another note u could also implement this optimizer, tbh i really like it https://openreview.net/forum?id=L9eBxTCpQG

lucidrains commented 2 hours ago

Well technically in the attn view they do the same thing here. You can imagine at inference ur row will all be multiplied by F then append the new token. Rnns do the same but issue is they add these values rather than concatenate (trade off), but forgetting should work the same (obv the forgetting here will learn different patterns bcs theres no compression so would forget more compared to if used in an rnn)

think we may be referring to two different things

agreed the forgetting comes out the same. the difference is each attention row allows a query token to aggregate the past differently

lucidrains commented 2 hours ago

On another note u could also implement this optimizer, tbh i really like it https://openreview.net/forum?id=L9eBxTCpQG

ah, feel free to pitch this to me over Signal

I usually don't implement too many optimizer papers

lucidrains commented 2 hours ago

@faresobeid oh that optimizer paper already has code submitted (supplementary zip)

faresobeid commented 2 hours ago

On another note u could also implement this optimizer, tbh i really like it https://openreview.net/forum?id=L9eBxTCpQG

ah, feel free to pitch this to me over Signal

I usually don't implement too many optimizer papers

Signal?

faresobeid commented 2 hours ago

@faresobeid oh that optimizer paper already has code submitted (supplementary zip)

Oh didnt realise lol