Replicating Results? - Githubissues

fattorib commented 7 months ago

Thank you for the code! I've been using it as a reference for my own implementation. Have you replicated the results in the original blogpost..? Based on your update in the readme, it seems like you have.

I'm asking since in my experiments up to ~125M params, the model still falls short of a standard Transformer +RoPE on the Pile. The exponential linear attention is slightly better than naive linear attention, but barely.

Any experimental setup/model details you have would be appreciated!

Thanks :)

lucidrains commented 7 months ago

@fattorib ah hey Ben! that's cool you are interested in linear attention too

yea, i don't expect this work to immediately help with language modeling just yet. i'm using it in the context of equivariant networks 1 2 there i saw a nice improvement over the other types of linear attention i usually employ, for supplementing local full attention

lucidrains commented 7 months ago

@fattorib the big issue is that the taylor expansion caps the head dimensions to ~ 16, and language modeling usually requires 64, 128, 256. they had to add gated convs to help offset this loss. what head dimensions are you using in your experiments?

lucidrains commented 7 months ago

@fattorib could you possibly share your wandb reports? i'm most curious about your run comparing naive to taylor linear attention and what their hparams are

fattorib commented 7 months ago

@lucidrains Thanks for the detailed response. Currently using head dim of 16, I also tried bumping up the value head dimension to 64, but no luck there. Even interleaving the GatedConv blocks they introduce, the performance still isn't up to full attention. I'll withold judgement until they release more code/results - right now reading through the blogpost and code, the overall architecture seems to be underspecified.

Just getting some new runs for linear attention and I'll share the Wandb here when theyre done!

lucidrains commented 7 months ago

@fattorib yea, i don't expect the value head dimension to help at all, but what you could try is increasing the number of heads. no matter what you fiddle with, 16 head dimension is just too underpowered. even for vision, usually head dimension are only as low as 32

lucidrains commented 7 months ago

i know this means nothing to you, but basically red is old linear attention, purple is taylor, blue is baseline (without). i've been struggling with different types of old linear attention (katharopoulos, performer) since forever. the improvements were never consistent, and sometimes adding it even harmed the final results. so when i saw the purple line it was encouraging! however, there could still be a chance that i have some bug in my implementation 😆 , be sure to give their official code a try

lucidrains commented 7 months ago

@fattorib have you tried turning on token shifting (from rwkv)? i've found it helps a lot for linear autoregressive attention

fattorib commented 7 months ago

@lucidrains I also tried scaling the number of heads from 12 to 50, without any major performance improvement... I'll give the token-shifting a try too.

fattorib commented 7 months ago

Sharing a WandB report comparing Taylor vs Linear vs Softmax Attention here.

I also have a second report where I tried increasing the number of heads to 48 on a smaller ~80M param model. This setup actually seems to perform closer to Softmax attention, but still falls a bit short.

lucidrains commented 7 months ago

thank you Ben, this is awesome!

lucidrains commented 7 months ago

beating full attention isn't the goal for me, I was looking for an improvement over old linear attention in special circumstances. I understand why you must have been disappointed

lucidrains commented 7 months ago

if you want to keep on the language modeling path, might I suggest interleaving with gateloop layers? your network will remain fully recurrent at inference time https://github.com/lucidrains/gateloop-transformer

fattorib commented 7 months ago

Thank you! I'll check the repo out.

lucidrains commented 7 months ago

@fattorib for your first experiment, how many heads does taylor and old linear attention have respectively?

fattorib commented 7 months ago

Both have 12 heads

lucidrains commented 7 months ago

thank you!

Doraemonzzz commented 7 months ago

Hello, I would like to ask if there are any experiments on short convolutions? In my experiments, short convolutions had no effect at all.

fattorib commented 7 months ago

All the Taylor-exp attention experiments in the wandb report I shared use the BaseConv in place of attention in every second Transformer block. This aligns with the Zoology repo in terms of architecture. I agree, adding them had almost no effect.

lucidrains commented 7 months ago

@fattorib yea, the convs were the least interesting part of their architecture

didn't expect much from it. try the gateloop + linear attention though (keeping the feedforwards)

Doraemonzzz commented 7 months ago

All the Taylor-exp attention experiments in the wandb report I shared use the BaseConv in place of attention in every second Transformer block. This aligns with the Zoology repo in terms of architecture. I agree, adding them had almost no effect.

Thank you for providing the experimental results. What do you think could be the underlying cause behind this?

Doraemonzzz commented 7 months ago

@fattorib yea, the convs were the least interesting part of their architecture

didn't expect much from it. try the gateloop + linear attention though (keeping the feedforwards)

Based on my experience with TransNormer and similar architectures like GLA, it seems to be working.

Doraemonzzz commented 7 months ago

My biggest question about short conv is, if it doesn't work at all, why do Base, Mamba, and RWKV all adopt this operation?

lucidrains commented 7 months ago

@Doraemonzzz my advice is, unless if you doubt your own experimental techniques, always trust what you see with your own eyes over the claims of a paper

Doraemonzzz commented 7 months ago

@Doraemonzzz my advice is, unless if you doubt your own experimental techniques, always trust what you see with your own eyes over the claims of a paper

Thank you for your suggestions. I am just worried about whether it may be a problem with my implementation, so having peers find the same conclusion makes me more reassured.

lucidrains / taylor-series-linear-attention

Replicating Results? #2