lucidrains / ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
MIT License
465 stars 27 forks source link

Zigzag ring attention support? #20

Open dwromero opened 3 hours ago

dwromero commented 3 hours ago

Hi @lucidrains ,

I hope you are doing well. And thank you for yet another useful repo! :)

I was wondering if you have any plans to support the zigzag version of ring attention. It seems to distributed compute better in autoregressive settings and is quite hot at the moment (https://github.com/zhuzilin/ring-flash-attention/issues/2). I could help if you need help with that.

David

lucidrains commented 3 hours ago

hey David, no problem

could you link me to the paper?

did you see the rotation trick from Chris Fifty yet?

dwromero commented 2 hours ago

Hi,

could you link me to the paper?

-> It's used in the Llama3 paper (https://arxiv.org/abs/2407.21783). Page 11 of the paper in the section on context parallelism. Though they don't actually use the form of ring attention implemented here, for GQA and attention masking reasons.

did you see the rotation trick from Chris Fifty yet?

-> I have not. What is it about?

lucidrains commented 2 hours ago

check out the vq repo

nice! didn't even know Meta was using ring attention 🤣 I'll read the paper tomorrow

lucidrains commented 2 hours ago

guess all the big players will be using some form of sequence parallel attention soon (google, meta, and you at nvidia)

lucidrains commented 1 hour ago

@dwromero could i prompt you for a summary of what zigzag is? is it just another way to permute the sequence for better balancing?

dwromero commented 33 minutes ago

That's right

lucidrains commented 29 minutes ago

@dwromero ok, should be an easy add!

dwromero commented 26 minutes ago

🤟🤟🤟