Open faresobeid opened 3 hours ago
@faresobeid yes, i was just reading "forgetting transformer" yesterday but haven't fully digested it yet. i'll need two weeks for this paper
oh, was value residual averaged? and yea i noticed they had a combo that worked better, but i'd need another hyperparameter for that (to specify resformer only, neutreno only, both)
Ya the forgetting transformer is basically mamba 2 style decay but on the attn mask, seems cool. For resformer + neutreno is that just doing both in ur code like that would u say?
@faresobeid yea, except it should be more powerful than associative scan based rnn, as you get a separate row per token
ohh, i see what you mean. they won't be averaged, as the two papers apply the residual in different places
@faresobeid but yip, i'm open to adding it once i see a positive experiment. think i will for this paper
@faresobeid yea, except it should be more powerful than associative scan based rnn, as you get a separate row per token
Hm not sure wym as mamba 2 decay is data dependent
@faresobeid but yip, i'm open to adding it once i see a positive experiment. think i will for this paper
Ya i mean tbf it is basically data dependent alibi so must be better than alibi at least
@faresobeid yea, except it should be more powerful than associative scan based rnn, as you get a separate row per token
Hm not sure wym as mamba 2 decay is data dependent
in transformers, each token will aggregate other tokens with their own separate forgetting pattern
that isn't the case with associative scan rnn
Well technically in the attn view they do the same thing here. You can imagine at inference ur row will all be multiplied by F then append the new token. Rnns do the same but issue is they add these values rather than concatenate (trade off), but forgetting should work the same (obv the forgetting here will learn different patterns bcs theres no compression so would forget more compared to if used in an rnn)
On another note u could also implement this optimizer, tbh i really like it https://openreview.net/forum?id=L9eBxTCpQG
Well technically in the attn view they do the same thing here. You can imagine at inference ur row will all be multiplied by F then append the new token. Rnns do the same but issue is they add these values rather than concatenate (trade off), but forgetting should work the same (obv the forgetting here will learn different patterns bcs theres no compression so would forget more compared to if used in an rnn)
think we may be referring to two different things
agreed the forgetting comes out the same. the difference is each attention row allows a query token to aggregate the past differently
On another note u could also implement this optimizer, tbh i really like it https://openreview.net/forum?id=L9eBxTCpQG
ah, feel free to pitch this to me over Signal
I usually don't implement too many optimizer papers
@faresobeid oh that optimizer paper already has code submitted (supplementary zip)
On another note u could also implement this optimizer, tbh i really like it https://openreview.net/forum?id=L9eBxTCpQG
ah, feel free to pitch this to me over Signal
I usually don't implement too many optimizer papers
Signal?
@faresobeid oh that optimizer paper already has code submitted (supplementary zip)
Oh didnt realise lol
Maybe can try https://openreview.net/forum?id=q2Lnyegkr8. Also for the value residual implementation I see u dont average the values, not sure if thats on purpose, and form the paper it seems like resformer + neutreno is best but im unsure how that looks exactly