lucidrains / ring-attention-pytorch

Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
MIT License
452 stars 26 forks source link

Comment about use of all gather #1

Closed NielsRogge closed 7 months ago

NielsRogge commented 7 months ago

Hi Phil!

Hope you're doing well. As you saw with Gemini Pro 1.5, which works on 1 million tokens, open-source has some work to do to catch up :D porting Ring Attention to PyTorch is definitely the first step towards that.

@rwightman made an interesting comment on your current approach of implementing Ring Attention, tought it would be useful for you to share that: https://twitter.com/wightmanr/status/1758275957557719308. Basically Ross had something similar he had to implement to make the SigLIP loss function work, leveraging neighbour exchange instead of allgather.

Btw, if your implementation is done, I would like to leverage it to port the LWM model that came out 2 days ago (https://github.com/LargeWorldModel/LWM). I would port the model to the Hugging Face Transformers library, by adding a LWMForCausalLM class. Since the weights are open-sourced I can convert them to the Transformers format.

Btw are you still active on any Discord channel?

Cheers,

Niels

lucidrains commented 7 months ago

hey Niels! good to hear from you and hope you have been well

Ross is reading a repo that is not done yet. i was not planning on using all gather for the ring reduce portion, if that is what he is critiquing. nonetheless, i think Ross' skillset is better suited for this type of work, and i welcome a critique after completion. also, it was brought to my attention yesterday that deepspeed ulysses may already have something similar, so we should look into that

you can best reach me through Signal these days if you want to chat! email me for the phone number

lucidrains commented 7 months ago

@NielsRogge where are you working now Niels? i thought you left huggingface for a while