Open dwromero opened 3 hours ago
hey David, no problem
could you link me to the paper?
did you see the rotation trick from Chris Fifty yet?
Hi,
could you link me to the paper?
-> It's used in the Llama3 paper (https://arxiv.org/abs/2407.21783). Page 11 of the paper in the section on context parallelism. Though they don't actually use the form of ring attention implemented here, for GQA and attention masking reasons.
did you see the rotation trick from Chris Fifty yet?
-> I have not. What is it about?
check out the vq repo
nice! didn't even know Meta was using ring attention 🤣 I'll read the paper tomorrow
guess all the big players will be using some form of sequence parallel attention soon (google, meta, and you at nvidia)
@dwromero could i prompt you for a summary of what zigzag is? is it just another way to permute the sequence for better balancing?
That's right
@dwromero ok, should be an easy add!
🤟🤟🤟
Hi @lucidrains ,
I hope you are doing well. And thank you for yet another useful repo! :)
I was wondering if you have any plans to support the zigzag version of ring attention. It seems to distributed compute better in autoregressive settings and is quite hot at the moment (https://github.com/zhuzilin/ring-flash-attention/issues/2). I could help if you need help with that.
David